DOCUMENT RESUME 



ED 236 156 



TM 830 267 



AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 
REPORT NO 
PUB DATE 
CONTRACT 
NOTE 

AVAILABLE FROM 

PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Messick, Samuel; And Others 
National Assessment o£ Educational Progress 
Reconsidered: A New Design for a New Era. 
National Assessment of Educational Progress, 
Princeton , NJ. 

National Inst, of Education (ED), Washington, DC. 

NAEP-83-1 

Mar 83 

400-82-0018 

lOlp. 

National Assessment o£ Educational Progress, Box 
2923, Princeton, NJ 08541 ($5.00). 
Reports - Descriptive (141) 

MF01/PC05 Plus Postage. 

Data Analysis; Data Collection; ^Educational 
Assessment; Elementary Secondary Education; ^Federal 
Programs; Latent Trait Theory; Methods; Policy 
Formation; Program Content; *Program Descriptions; 
^Program Design; Research Needs; Sampling 
Balanced Incomplete Block Spiralling; ^National 
Assessment of Educational Progress 



ABSTRACT 

This report presents the conceptual framework and 
major features of the new design for the National Assessment of 
Educational Progress (NAEP) as conducted by Educational Testing 
Service' beginning July 1983. It comprises three major chapters. The 
first chapter reviews the social and environmental changes that 
demand reconsideration of NAEP. The new design was formulated to 
address concerns focusing on performance Standards , school ^ 
effectiveness questions, and broad human resource issues, thereby 
improving NAEP*s relevance to educational policy and practice. The 
second chapter discusses technical innovations now possible with 
iproven modern techniques that greatly enhaiice the power and value of 
the collected data; Sampling by gi^ade as well as by age permits 
estimates of performance and trends to be reported by both age and 
grade, thereby allowing direct links to state and local assessments, 
school practices, and educational policies. The third chapter 
illustrates ways the new design addresses multiple policy questions, 
communication with multiple audiences in an effective fashion,;, 
linkages to other data sources, enhancement and extension of NAEP 
services, and engagement of the public on the important educational 
issue of performance standards. Primary type of information provided 
^by the report: Program Description (Operating Policies); Procedures 
( Conceptual ). (PN ) 
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Preface 



This report presents the conceptual framework and, major fea- 
tures of the new design for the National Assessment of Education- 
al Progress (naep) as conducted by Educational Testing Service 
(ets) beginning July 1983. 

The new design is compiehensive in chat it entails procedural 
changes in sampling, objectives setting, exercise development, 
data collection, analysis, dissemination, and user services. It is 
inclusive in that the Assessment is extend;id to previously exclu- 
ded or inadequately represented populations— in particular, to 
functionally-handicapped and limited-English speaking students 
as well as to out-of-school 17-year olds and adults. It is innovative 
in that modern psychometric metholology is applied to move the 
Assessment beyond the level of discrete -exercises or arbitrary 
exercise composites to the level of measurement of performance 
dimensions. It is protective of continuity in that statistical links 
are forged to past methods and data to maintain and enhance the 
examination of trends. Jt is practitionei-oiiented in that perfor- 
mance data are systematically tied to background and program 
variables relevant to educational policy and practice. And, it is 
aggressive in its involvement of user groups, educational consti- 
tuencies, societal stake holders, and the general public to amplify 
NAEP's impact not only on the conduct of 'education but on the 
pluralistic standards and goals of education. 

The report comprises three major chapters covering in turn the 
reasons for the new design, the nature and power of the new 
.design, and the implications and payoff of the new design. The 
first chapter reviews the strengths and weaknesse* of the^original 
assessment design and its responsiveness to the political-realities 
of its time. When social and environmental changes that demand^ 
reconsideration of naep today are examined, it becomes clear chat 
current national concerns focus on performance standards, school 
effectiveness questions, and broad human resource issues. The 
new design was formulated to address these concerns using 
National Assessment data, thereby improving naep's relevance to 
educational policy and practice. 

The second chapter discusses technical innovations now possi- 
ble with' proven modern techniques that greatly enhance the' 
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power and value of the data collected. Through the use of a bal- 
anced incomplete block (bib) spiralling variant of matrix sam- 
pling, exciting new analyses are feasible because the data arc no 
longer booklet-bound. CoVariances may now be computed among 
all exercfse.?; in a subject area, so that 

• composties of exercises can be appraised empirically for coher- 
ence and construct validity; 

• the dimensional structure of each subject area can be deter- 
mined analytically as reflected in student performance consis- 

' tencies; 

• item response theory (irt) scaling can be applied -to unidimen- 
sional sets of exercises regardless of what booklet they appear 
in; 

• IRT scales can be developed havi common meaning across 
exercises, population subgroups, age levels, and time periods; 

• more powerful trend analyses can be undertaken by means of- 
these" comrn.on scales; 

• performance scales can be correlated with background, attitu- 
- dinal, and program variables to address a rich variety of educa- 
tional and policy issues; and, 

• public use data tapes can be made much more useful because 
secondary analyses are also no longer booklet-bound. . 

In addition, groups previously excluded from the Assessment 
(the limited-English speaking and functionally handicapped) are 
studied more intensively. Sampling is refined to provide'better 
representation of Hij panic students in terms of their major cul- 
tural subgroups (Puerto Rican, Cuban, and Mexican American) 
'and to permit systematic reporting of Hispanic results separately. 
Sampling by grade as well as by age permits estimates of perfor- 
mance and trends to be reported by both age and grade, thereby 
allowing direct links lo state and local assessments, school ptac- 
tices, and educational policies, which are all typically grade- 
based. Samples of adults and out-of-school 17-year olds are 
reintroduced into the^ Assessment by cost-effective means that 
also link the exercise performance levels of these groups to labor- 
force participation data and employment trends. 

The third chajpterrfllustrates the ways in which the new design 
facilitates the a&ressing of multiple policy questions, communi-' 
cation with multiple audiences in effective fashion, linkages to 
other data sources,- enhancement and extension of Assessment 
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services, and engagement of the public on the important cduea- 
tional issue of performance standards. " • 

The preparation of tb'-; report was partially supported by the 
National Institute n( Fd^cation [mn] under contract No. 400-82- 
0018, which r. (oi dv) development of "cost-efficient, imagi- 
native alternative designs to conduct a National Assessment of 
Educational Progress/' It was also included as the lead section on 
Proposed Design in the successful r.TS proposal for "I he Conduct 
of the National Assessment of Educational Progress," in response 
to NiE Grant Announctjment No, PA-82-000L Now, in order to 
make the rationale and plans for the redesigned naep widely avail- 
able to a variety of interested publics^ the report has become the 
first release in the new series of naf.p publications under the ets 
grant. 

Samuel M.essick 
Princeton , New Jersey 
March, 1983 
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I. 

The Original Assessment Design 
and Changing Assessment Needs 



The ori>;inal dcsixn of the National Assessment of Educational 
Pro^jrcss (nahp) was brilliantly responsive to the political con- 
straints of the time. Established in the 1960s to assess the condi- 
tion and progress of education in the country, the original naep 
design attempted to take due aeeount of the existing politieal and 
soeial realities that were likely to jeopardize its successful implc- 
meritation. Prominent among these concerns was the recognition 
that an expanded federal role in education) coming at a time of 
limited state capacity, represented a serious threat to state and 
local education ^igcncics. Of prime importance was the feeling 
that the sanctity of local control of education might be perceived 
to be undermined by a nationally imposed assessment effort if it 
conveyed overtones of national curriculum and national testing. 



The Politics of Assessment and Its Legacies 

!n light or such concerns, the original naep architects developed a 
sampling plan insuring that accurate results could not readily be 
reported at the state or district level. They espoused matrix sam- 
pling procedures insuring that no individual would take more 
than a small sample of diverse exercises or items, so there would 
be no tests or test scores in the traditional sense and certainly no 
test scores for any individuals. They capitalized on the strengths 
of matrix sampling to insure comprehensive coverage in depth 
within subject matter and in breadth across subject matters, 
thereby generating sets. of objectives and exercises that reflected 
salient features of most extant curricula but were too extensive to 
be incorporated in practice in any single curriculum, national or 
otherwise. They insisted on analysis and reporting at thp exercise 
level, so that the focus would be not on curriculum units or 
knowledge and skill domains, but on specific learning outcomes 
whose nature and importance could be directly judged by laymen 




iiml jnofcsslomils iilikc, As a liniil cxMiipIc, the iim'ssnicnt wiij* 
o/Kanlzcil In terms «»( m' Il'vi-Is rather than Mfiule Icvi'Ih, which- ' 
while havlDK ti iiuinher (»f importnnt points hi Us /nvor-haft the., 
cunsctivicnce o( scvcrlUK naii' results (runi the mnlor wfty lii 
which schools arc orMaiilzeil, state and k)eal assessmvuts arc 
reporteil, ami cihieiitioiial policies arc formulated, Thus, since 
the original nafp deslK«rt)y dellhernte plan made It difficult If not 
impossible to link asscs\i\ent results to state or district. proKranis . 
•or t») Kradc-rehued practices in the schools, educators were less 
threatened and political Icasihility was assured. .However, the 
very dcsljin features that were advautaKcous from n political 
standpoint also carried the heavy cost of attcnuallnK ;hc useful- 
ness of the assessment results for affcetinK cduentional practice, 



'I'lit! l>robU;in of DtsfuUHlblc Inlul'prctallonH 

• ^ • ■ ' 

The innin problem with the oriKinal assessment deslKn Is ono of 
meaniiiK^and interpretability of the fUidiiiKH. The intended bene- ' 
fits of exercise-level reporting were simpl<5not realized— namely, 
that the speeifie learninn outcome embodied in a dlserete exereise 
readily conveyed its own criterion-referenced standard ahd thrtt,a 
direct link could be easily perceived between the exercise rind the 
educational objective it represented. On the one hand, discrete 
exercises may often be interpreted to reflect multiple oTijcctivcs 
and, on the other hand, it is a rare educational objective of any 
importance that can be fully captured in a single inst*incc of 
behavior. Rather, educational objectives refer to consistencies in 
student performance that cut across classes of behavior (Cron- 

bach, 1971). ^ • 

' This limitation of strict exercise-level reportmg of percent-co^. 
rcct on each exercise was eventually addressed by naep by also 
reporting average percent correct on aggregations of exercises pre- 
sumed to reflect the same dimension or objective. Bui these ag- 
gregations were determined on the basis of educators' judgments 
and may or may not be supported cmpiricallyin terms of student 
performance consistcncics'on the exercises judgmcntally aggrega- 
ted. What is needed is not only a means of justifying judgmental 
exercise aggregations in terms of student performance Consisten- 
cies, but of empirically determining the aggregations of exerciSes 
that best reflect existing performance consistencies of educa- 
tional impbrt. In either case, since the aggregations are interprc- 
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ted in terms of performance constructb (such as reading compre- 
hension and computational skill), evidence must be accrued for 
their construct validity and for linking them to educational objec- 
tives or sets of objectives as well as to domains of knowledge and 
skill within subject-matter areas. ' ' . 

The critical requirement for establishing interpretable and de- 
fensible aggregations of exercises is to develop a capability for 
estimating correlations or covariances among exercises. as well as 
be,tween dimensions of exercises and other variables.' This cap- 
ability would permit an empirical evaluation of the coherence 
and construct validity of the judgmental or nominal exercise cate- 
gories interpreted in past assessments as "reading comprehen- 
sion/' "science knowledge/' and so forth. More importantly, it 
would permit an evaluation of empirically-grounded exercise 
categories at different levels of generality, including the possi- 
bility of higher-order skills that might cut across content or sub- 
ject-matter domains. For example, one could appraise the empiri- 
cal viability, not only of exercise categories tightly tied to the 
behavibral language of task performance,.-;SUch as "adding two- 
digit numbers/' but also of performance'constructs of increasing 
generality, such as computational accuracy, number facility, and 
higher-order skills of quantitative reasoniil'g '^nd problem solving, 
it would also.be possible -to. assess the exifent to which 'higher- 
(irder skills such -as- problem solving and critical judgment cut 
across sub jC'Vv ;.'?.atter H^^ . 

By analyzing ahu vSportitig assessment results only in terms of 
specific exercises and ur /ehfied^udgmeritaPor nominal' exercise 
categories, the relation of trend^^tb more' useful indices of achieve- 
ment is obscured. But by analyzing and reporting epipirically- ■ 
grounded' performance consistencies that are interpretable in 
terms of. educationally meaningful dimensions of knowledge and 
skill aij4:,that can be related to other variables of background,* atti- 
tiide, sbhool, and program, the practical and policy implicatiyns' 
of the results may be more directly addressed. 



• The Problem of Comparability 

To realize these benefits; however, we ne'^S^some means of assur- 
ing comparability of meaning of performance across exejfcises 
within performance dimensions and, of prime importance, com- 
parability across different time periods. Since many factors can 



affect percent success on a given exercise, the measurement of 
change in terms of single exercises is inherently difficult to inter- 
pret. Nor do differences in average percent correct across sets, of 
exercises provide satisfactory indices for assessing chaiige. A key 
problem is that th~e relationships between percentages and quan- 
titative variables such asl:liose descriptive of background or pro- 
gram characteristics are typically nonliiiear, so interpretations of 
the meaning and sources of percentage ' change are often either 
misleading or abstruse. This difficulty may be overcome, how- 
ever, by employing a scaling model such as Item Response Theory 
(Lord, 1980a) that transforms percent correct to a logit scale 

jlog to define latent continua which are typically linearly 

related to other quantitative variables. 

An important outcome of this item response theory (irt) scal- 
ing is that exercises are characterized by invariant scale para- 
meters that are directly comparable across exercises on the same 
latent dimension, whether at the same or different points in time. 
This enormously simplifies the measurement and interpretation 
of changes and trends over time. However, to protect and main- 
tain the capability for trend analysis ovjer past as well as future 
data, the procedural changes entailed in covariance estirnation 
and.iRT scaling should be introduced in a way that forges techni- 
cally viable links to past data., 

Although these and other design features are recommended and 
examined in detail in the body of this report, we are concerned 
not only with improving the meaning and interpretability of the 
assessment results but also with enhancing their utilization in af- 
fecting educational policy and practice. As a consequence, we 
will address not only the redesign of data collection, analysis, and 
reporting procedures but also the redesign of other naep activities 
and functions bearing on objectives setting, dissemination, and 
knowledge utilization. 

Before presenting our recornmendations for redesign, however, 
* we will first address the reasons why we think such innovations 
are feasible in the present political and social context by examin- 
ing ma'jor changes that have occurred in this regard since the 
1960s. Then, to insure that our redesigned naep will be respon- 
sive to current policy issues and flexible enough to respond to 
changing policy issues, we will next assay the major classes of 
policy questions that dominate the current educational scene as 
well as those looming large on the horizon. We are particularly 
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concerned about those kinds of policy questions that naep should 
be in a position to address but cannot be effectively handled in its 
present mode of implementation. Next gomes the main section^of 
the report which presents' the recommendations for redesign in 
detail and provides the rationale for resolving the major design 
issues. . - 

Finally/ the closing section of the report reviews how the new 
design improves the meaning and interpretability of assessment 
results and trends/illustrates its capability for timely response to 
current and new policy questions and its flexibility for addressing 
a variety of such questions, 'and recommends ways of enhancing 
NAEp's educational impact. The stress in connection with this lat- 
ter point is on the development of linkages— primarily between 
NAEP exercises and those used in large or longitudinal research 
data baseS; in statewide assessments, and in commercially pub- 
lished educational tests widely employed in both state and local 
asS^essments. By these means the results of research, state, and 
local studies rhay be viewed in national perspective and the qual- 
ity and comparability of assessment at all levels thereby en- 
hanced. Other linkages to be developed are those between the 
objective setting and standard setting processes and their atten- 
dant connections to exercise specifications, performance out- 
comes, and progress toward the attainment of standards. 



Factors Shaping NAEP in the 1980s 

The context of education policymaking in the 1980s is sig-" 
nificantly different from that of the late 1960s when naep was 
initiated. This section examines the current environment and 
discusses the policy issues naep should be able to address. Of par- 
ticular importance in understanding the issues and factors 
presently shaping naep are (1) the changed federal role, (2) an in- 
creased state capiacity for problem solving, (3) an erosion of edu- 
cational credibility, and (4j.the reduction of financial resources. 
Taken together, along with growing and pervasive pressures for 
educational accountability, these forces create new demands that 
must be accomrriodated if naep is to be a useful policy tool in the 
future. 

- ■ .. 5 
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The Changed Federal Role ; 

Prior to the 1960s the federal government's involvement in edu- 
cation was modest, confined almost exclusively to^assisting 
states with activities they had. already adopted. When naep was 
developed, however, the legacy of President Johnson's "Great 
Society" was in full sway and the federal role had undergone a 
significant and fundamental change from that of assisting state or 
local governments to accomplish their own objectives to that of 
using federal money to accomplish a national purpose (Sundquisr 
Davis, 1969). 

The Elementary and Secondary Education Act of 1965 (esea), 
with its emphasis on disadvantaged children and a focus on build- 
ing state capacity, was a dramatic and ambitious effort to enlist 
locai and state education agencies in meeting national objectives. 
Moreover, it served as the centerpiece fpr a continuing series of 
measures to extend federal concern to other previously excluded 
groups: migrants, native Americans, the limited-English speak- 
ing, and the handicapped. This new activist thrust of the federal 
government was the result of two critical assumptions concern- 
ing state and local education agencies (seas and leas): first, that 
they either did not know hov'. or did not fully accept the respon- 
sibility, to adequately teach ; ' ' ^v^antaged children; and second, 
that an infusion of knowledgv ^.nd federal resources could im- 
prove the quality of elementary and secondary education. 

This expanded federal role represented a threat to many state 
and local officials in that it not only changed the', traditional 
" stance of the federal government in education, but in some in- 
stances it cohfhcted with state and local practices. Distrust was 
great in both camps: federal officials often felt state and local edu- 
cation personnel were not interested in, or capable of, dealing 
with federal concerns; state and local administrators ' feared the* 
imposition of federal regulations and. a national curriculum on 
what had been their time-honored bailiwick of "local control." 
Passage of the Civil Rights Act of 1964 and subsequent^enforce- 
ment-of school desegregation guidelines under Title VI stoked 
these fears of federal encroachment. o,-. 

It was in this environment of tension and distrust that naep was 
designed and implemented. Originally the central question before 
the developers of NAEP was how to collect representative national 
dat^-oa educational competence while assuring state and local ad- 
ministrators that no federal standard would be imposed nor in- 
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vidious comparisons made among states or districts, naep was 
merely to be a barometer of the nation as a whole. Usefulness to 
state and local officials was not a primary consideration. 

The 1980s represent a different political environment. The con- 
cept of a "New Federalism/' with its emphasis on state and local 
capability for problem solving, hopes to capitalize on the achieve- 
ments of the past fifteen years, of activist federal involvement 
while attempting to deal with jany problems such a federal role 
created. Not surprisingly, the fifteen year record of federal acti- 
vism produced both positive anci negative effects. Most positive 
was the adoption of many national-objectives and the upgrading 
of state capacity. Aid for compens^atory education is now a feature 
of 24 state-aid laws (Silverstein et al., 1977). -Bilingual education 
and education for the handicapped have also seen parallel devel- 
opment, with the. states in some cases taking the lead- and the 
federal government left to imitate (Wilken & Porter, 1977; 
Moore, Walker, & Holland, 1982). The negative element of past 
federal policy on the one hand is a growth in paperwork burden 
and, on the other hand, the development of statutes and guide- 
lines which, when imposed on the diverse state political cultures; 
sometimes have impeded rather than enhanced national objec- 
tives (Hill & Kimbrdugh, 1981). " ' 

States today may be no less afraid of national standards and cur- 
ricula, nor should they be, but they appear to be much more open 
'to the use of national comparative information about educational 
achievement that could help them set their own standards. 
Although on occasion there have been isolated calfs for a "na-. 
tional standard"— for example, by Admiral Rickover during the 
1978 hearings on reauthorization of esea— such proposals increas- 
ingly are viewed as "straw men" and have consistently been op- 
posed by federal education officials on the grounds that setting 
standards is clearly a state responsibility. The central question 
now before the directors of naep is how to conduct a national 
assessment that will b^ directly relevant to state arid local policy- 
makers as well as serve as a creditable national indicator of educa- 
tional competence for the general public. 

State Capacity for Problem Solviiig 

When NAEP was being planned, there was a prevalent stereotype of 
the. "backward sea.'/ In his 1965 testimony Urging support for 



Title V of .ESEA, Commissioner of Education Francis Keppei de- 
tailed the weaknesses of state departments, pointing to tbeir lack 
of staff, inability to moriitcr and coordinate programs, aad gen- 
eral absence of planning a:ctivities (Bailey Mosher, 1968). Since 
that time there has been considerable upgrading of state depart- 
ment personnel and functions (Murphy; 1973). Virtually ail fed- 
eral elementary and secondary education legislation contains 
funds for some state department activities— from inoritoring and 
evaluating programs, planning needs assessments and coordina- 
ting staff development to increasing equrty in school finance for- 
mulas. As McDonnell and McLaaghlin ( 1952) point out: "Even 
those agencies with the fewest resources are able to do more than 
they could fifteen years Bgo, and most seas are capable of provid- ' 
; log significantly more services to local districts." This increased 
v.npability is not merely the result of the infusion of esea Title V 
dollars and other federal monies, but also results from state 
responses to the public cries 'for accountability and for demands 
that the educatiojial system "do" something" in the wake of bad 
publicity regarding student performance (McLaughlin, 1981). 

Today, state departments of education play a major role in locaL 
school improvement efforts (Odden A Dougherty, 1982), They 
need a wide variety of information on school effectiveness and 
the relationship of achievement to such factors as school organ- 
ization, staff training, competency requirements and the like. 
NAEP should be able to contribute relevant data and analyses to 
help meet these widespread informatioi' needs. 

Educational Credibility - ' ■ " 

When NAEP began, there was some concern about how well the 
states were serving particular groups, such as the poor and racial 
or ethnic minorities, as well as serving particular national man- 
power needs |we were'just recovering from the Sputnik shock). 
But overall there was a belief that the nation's public schools 
were sturdy^ productive institutions.' In fact, it was the confi- 
dence in schools and their mission that caused the planners of the 
. Great Society to enlist education as the principal soldier in the 
War Against Poverty (Gardner Presidential Task Force of 1964). 

In the 1960s, in':'2ed even into thcearly 1970s, as the Gallup 
Annual Education Polls indicate, Americans generally felt their 
schools were doing'a good job (Plii Delta Kappan, 1978). The n;ia- : 



jority gave their local schools good grades and believed schools 
were better thari when they themselves had attended/ Today the 
confidence is severely eroded, however, and the majority no 
longer believes schools are as effective as they had been in the 
past. / V .• ' ■ ■ ■ • ■ 

Several factors, some common to institutions in general and 
others specifically related to education, have contributed to this 
credibility gap. The'disillusionment in the late 1960s and early 
1970s with America's involvement in Vietnam coupled with the 
Watergate revelations of the Nixon Administration served to 
undermine confidence in many of our traditional institutions— 
from the Presidency to the military to business to education.. But 
other developments— the sAT-score decline, violence arid varidal-^ 
ism in the schools, and accounts of illiterate high school grad- 
uates— created new demands for accountability. Consumers ol 
the "products" of the education system began to sound the 
alarm. 

The College Board (1977) announced the creation of a Blue Rib- 
bon Panel to investigate the sat score decline; the Senate held. 
hearings to determine the extent and effect of 'violence and van-- 
dalism in the schools (Bayh,. 1977); Pentagon.officials argued in 
Congressional testimony against a volunteer arm^y> citing the 
lack of preparation of high school youth;^, businessmen com- 
plained about the need to train workers to ^compensate for the in- . 
adequate basic skills' of high school graduates; and finally, even 
students themselves have brought a few malpractice suits against 
the system for'faiiing to educate them (Bairatz & Hartle, 1978). ^ 
Tales .df the educational insufficiencies of young people are com- 
monplace inrthe media and the cries.fof relevance, so prevalent in 
the 1960s and early 1970s," have been replaced by demands for 
rigor (Fiske, 1981). 

. One result of the concern about quality was the call for stan- 
dards. In the early 1970s some states had initiated statewide 
assessments to monitor general education achievement within 
their states. In the mid-1970s— with the hue and cry over poor 
performance of graduates, grade inflation/ and social promotion- 
many states began imposing minirnum competency standards on 
students (and in the late 1970s some. states began competency 
testing for teachers). Within a few years, over two-thirds of the 
states had minimum competency requirements and virtually 
every state now has a statewide assessment or minimum' compe- 
tency testing program (Baratz, 1980). "Seat diplomas" were 



replaced by specific course requireraents and demops^t^ 
petencies for graduation. In the 1980s naev= should net only assist ; , 
education agencies to assure high quality assessnient: prograihs, 
but'should also fa:ciljtate the linking of information now available 1 
at the state and loc?.l levels with naep data. By this means, ques^^^^^ : 
tions concerning school practices, curricula, progress of particu- 
lar student groups and the like could be more fully addressed, fnd 
/;assessment results ^^yould be more useful to education admin^^^ 
^trators, classroom teachers, and the taxpaying puiflic. 

Fiscal Pressure 

NAEP was conceived in the "salad days'/ of the the 
. economy was expanding, emoiliaenti; ^^Qic-^aoyyisig,^ schools en- 
joyed the full support of their communities/^ 
were increasing. Today the situation is dramatic^^^ 

Since the mid-i970s, there has been a. ma^^ the/ 
defeat of local school budgets. Eveii more significant has been the 
"Proposition 13'' phenomenon of tlie late 19^TO^ 
ting taxes and expenditures that severely curtail money available 
for schools. In addition, along wit^^ 
. mpnt, there is also a noticeable biit modest drift ;m 

toward private education (Rv^i;.' Smith, 1982) and ^iively debate 
. "V^^ vouchers, tax credits, and other incentives to support ■ 

private schooling. The state purse almost everywhere is in. "ill 
health^'^Wheri compared, to a decade ago (Shulins^ 1982). Tax 
revenues are not keeping up with inflation. As Adams (1982) 
observed, four factors are generally responsible for this deteriorat- 
ing condition: "(1) significant efforts by states to reduce tax bur- 
dens, (2) changes in federal individual and corporate income tax 
structure, (3) a severe recessiori'beginning in 1981, and (4) major 
cutbacks in federal aid to states and localities." 
Demographic changes— declining enrollments, shifts from the 
■ cities, increasing numbers of older citizens— have also affected 
state funding for education. Educators; now more than ever 
before, find themselves competing with other interests for their 
share of the public purse. When political competition is coupled 
with tight dollars in state and local governments, the pressure on 
educators increases. Meeting the expanding responsibilities of 
the education system and providing quality education with 
declining real resources is the major challenge facing state and 



local education agencies iu the 1980s, naep should provide infor- 
mation to state and local officials that is relevant to the effective- 
ness of various school improvement strategies, information that 
is not only useful in planning but also addresses state-specific 
needs. 

For all of these reasons, we feel that innovations to improve the 
interpretability, policy relevance, and utility of naep are not only 
feasible in the current political and social climate, but just about 
mandatory. 

Policy Issues NAEP 
Should be Able to Address 

It seems clear that naep must now serve a wide audience with 
diverse needs.» Criticism of naep in the past has underscored its 
failure to be responsive to policy needs (Wirtz & Lapointe, 1982; 
Milrod, 1980;^Wiley, 1981; Sebring & Boruch, .1982). What are 
some of the issues that naep should focus on as it reorganizes to 
meet thej;:hallenges of the eighties? 

Among the variety of pressing issues, three general policy areas 
stand out wUch should be addressed by naep because they require 
reliable data on student competencies and achievement: student 
competencies as they relate to national concerns) student achieve- 
ment and attitudes as they relate to human resource needs-, and, 
student achievement as it relates to school effectiveness. In 
addressing these issues naep must not only be able to provide a. 
national overview, but must also be relevant to state and local 
concerns— not for the purpose of needless comparisons among 
states or school districts but to assist individual states and locali- 
ties in meeting their goals and objectives. - ^ 

■ ^ . ,»■•.■ 

National Concerns 

Since naep's inception, the federal government has designed aiTd 
impleipented education policies to provide equal educational op- 
portunity to all citizens and to assure that young adults would be 
able to contribute to society in terms of both productivity and 
participation in the democratic process. The government clearly 



understands that an educated populace is a fundamental require- 
ment for the nation's political and economic well-being. A major 
responsibility of naep should be to provide information for 
governmental and educational policymakers on the effects of 
their efforts and tb act as an "early warning system" of potential 
problems. . . 

At a minimum, naep data should be relevant . to the following 
kinds of questions: 

Are today's students learning the skills necessary for produc- 
tive functioning in America in the 1980s? The 1990s? The year 
2000? ^ ' 

Are today's youth developing the flexibility to reorganize .their 
skills in response to occupational and societal change? 

' Xre students in urban, suburban, and rural schools all being ade- 
quately prepared? 

Are public and private school children equally':^ell prepared? 

\ Do children have access to programs prepiiring them to deal 
with the computer agel . . 

Are minority and disadvantaged youngsters being so prepared? 

Do minority and disadvantaged students iii desegregated learn- ■ 
ing environments perform better than those educated in segre- 
gated settings? 

What types of programs or allocations of resources seem to, 
make a difference for disdvantaged and minority students? 

Are children from limited-English speaking homes being pro- 
vided the necessary skills? 

Do students who have received special services under federal or 
state prograins perform better than similar children who have 
not had access to those programs? 
. Are students developing cultural commitment and apprecia- 
tion, whether in arts and humanities or in science and tech- 
nology, or both? , 

Do students leave formal educatioii with a positive attitude 
toward continued learning so essential in our rapidly changing 
environment? 

Do students leave formal education , with positive attitudes 
toward productive work? , ' 

12 ■ ..■ ■ • " ■ 



Human Resource Issues 

' • ,. ■ . ■ 

The federal government is concerned with the flow of human 
resources to assure a work force competent to function in an ad- 
vanced technology society and the necessary military personnel 
to protect American interests. Planning for human resource 
deployment is a complex process that requires reliable informa- 
tion on young people's competencies, training, and attitudes. 

In the past we have vacillated between feast and famine in 
critical personnel areas. In the late 1950s, with Sputnik's launch- 
ing, we were acutely aware of our need to develop more scientists 
and engineers. By the late 1960s, however, the market was glut- 
ted and engineers and physicists were seeking new careers. 
Today, once again we find ourselves undersupplied in the science 
and technology fields, with dim prospects for the future if stu- 
dents do not have a chance to be trained in science and to learn 
about career opportunities-, naep should assist governmental and 
educational policy planners by contributing information on the 
following kinds of questions: 

I 

What are the competencies of students in math and science and 
what are their attitudes toward these fields? 

What kinds of training do students receive? 

What are thexareer goals of high school students? 

What are the attim^e^of today's youth toward-the military? 
toward business? 

To what degree do students with access to science and high 
technology curricula choose careers in science more than those 
with no such experiences? , 

•Are we prepariiig youth to meet the human resource needs in 
the health sciences? the humanities? teaching? 

Are* vocational/occupational programs eq'uipping students with 
the skills they need to function in the work place? 

The answers to these questions are of value to business plan- 
ners, to parents, and to students themselves as well as to educa- 
tors and governrnent agencies. 



School Effectiveness 



School administrators are faced with rising costs and multiple 
demands oh limited resources. They must choose among a host of 
competing interests. Achievement data, to be most useful, should 
be tied to other information to guide poKcymakers in deciding 
how they might best organize their programs and disperse their 
funds. Although achievement is influenced by many factors- 
some school related, others beyond the school's control— test 
data are one measure of the effectiveness of schools. Holding 
other variables constant, what factors within the purview of 
school administrators appear most likely to contribute to in- 
creased achievement? How can naep assist state and local policy- 
makers to improve schooling? 

If NAEP is conceived not merely as a social indicator, but as a 
tool to identify problems and suggest areas of potentially produc- 
tive research concerning educational progress, naep should. at- 
tempt to provide data that address the following kinds of policy 
issues: I 

Do students in programs requiring minimum competencies 
and/or graduation test requirements seem to achieve better 
than other students? 

How do pupil/teacher ratios appear to relate to achievement? 
Do students with preschool and/or kindergarten experiences 
seem to perform better than those without such programs? 
How do particular curricular approaches relate to student 
achievement in reading? writing? math? 

What are the relationships of the length of the school year 
and/or the availability of summer programs to school achieve- 
ment? 

What are the relationships of in-service training programs, 
teacher turnover rates; and teacher competency requirements 
to student performance? 

What types of programs or allocations of resources sieem to 
make a difference in improving school effectiveness? 

Although for a number of reasons to be discussed later naep is 
not an appropriate research vehicle to address all of these ques- 
tions systematically or in depth, timely analyses of the achieve- 
ment d^ta in relation to relevant background and program vari- 
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ables should suggest provisional interpretations 3nd promising 
leads that merit further research attention or special naep probe 
studies. ^ 



Implications for Redesign 



Henry Acland (1980) succinctly defined the major functions of 
NAEP: to provide an information base for federal policymakers, to 
establish a data base for research, to keep track of performance 
levels, and to help state and local education agencies, naep, as 
originally designed, cannot meet all the demands presently thrust 
upon it. In order for the assessment to be most useful, it will be 
necessary to alter some of its practices. The following sections 
propose ways in which naep should be redesigned to address 
policy issues of the type we have identified here as important to 
current educational practice. To do this we must attack issues of 
statistical inference, sampling efficiency, age and grade sampling, 
timely data collection, covariance estimatioSl, construct validity, 
dimensional analysis and scaling, trend analysis, correlations 
with background and program variables, and "causal" analysis. 
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' A New Assessment Design 

Responsive to 
Changing Assessment Needs 

The proposed redesign of naep builds solidly on the original design 
—but with im][^ortant modifications, extensions, and innovative 
additions: 

The new design retains the cyclical scheduling of subject-area 
data collection— but ( I ) changes to a planned schedule of biennial 
assessment, (2) introduces the assessment of reading into every 
biennial wave so as to increase the timeliness of information in 
this basic area as well as to calibrate different cohorts at each age 
level, and (3) establishes coverage of four subject-matter fields as 
a minimum tafget for each assessment wave* The off years are 
available for focussed studies of special problems or special popu- 
lations—such as assessing the educational competencies, and in 
succeeding years the educational progress, of functionally handi* 
capped or limited^En^li$h-Spea^ing students. Special assessment 
probes in areas aslyct'^otcovered, such as computer literacy or 
foreign languages or global awareness, could be conducted either 
in off years or in connection with a regular assessment wave. In 
time, NAEP might capitalize on the field presence entailed by 
special studies during off years to move the assessment of reading 
and perhaps mathematics to an annual schedule. 

The new design retains.the current deeply stratified three-stage 
sampling plan— but introduces important additions at the third 
stage of randomly sampling students within schools so as (1) to 
effect sizable sampling efficiencies (through the application of a 
powerful variant of matrix sampling called balanced incomplete 
block, or BIB, spiralling), (2) to document more fully the charac- 
teristics of students presently excluded from the sample as npt 
validly testable by current naep procedures, and (3) to undertake 
sampling by grade level as well as by age. For the second assess- 
ment wave in 1985-86, when it would be possible to influence the 
other stages of the sampling plan, steps would be taken to attain 
better representation of Hispanic students in terms of their major 
cultural subgroups (Puerto Rican, Cuban, Mexican American) 
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and to undertake systematic reporting of the educational progress 
of Hispanics separately. . ^- 
' The new design retains matrix sampling procedures— but as 
modified in the form of bib spiralling so as (1) to reduce school 
clustering effects and thereby samplirfg errors as well as to pro- 
duce increased information with a given sample size, (2) to per- 
mit IRT scaling of exercises across booklets for objectives and per- ^ 
formance dimensions spanning thp subject-matter area as well 'as 
for those spanning different age l^els, and (3) to estimate covari- 
arices among exercises. The ability to estimate, covariances 
among exercises within a subject area means that the cohesive-, 
ness of judgmental exercise categories can be en^irically evalu- 
ated, performance categories or dimensions can be empirically 
determined by methods of factor analysis and cluster analysis, 
and uriidimensionality assumptions of irt scaling can be empiri- 
cally appraised. Once exercises are successfully scaledby iRT^pro- 
cedures, pupil proficiency estimates can be related to back- 
ground, attitudinal, and program variables for the same pupils so 
that external correlates^ and thus the construct validity, of exer- 
cise dimensions can be appraised. In the second assessment wave, 
spiralling of exercises across subject-matter areas will permit 
knowledge and skill dimensions in one area to be empirically 
related to those in' another. Such spiralling will also allow assess- 
ment of the degree^to which higher-order skills such as inferential 
reasoning or decision making cut across subject areas. ^ 

The new design retains the capacity for comprehensive cover- 
age of subject matter attained through matrix sampling—faut 
capitalizes on the structural nature of response consistencies in 
exercise performance, as appraised or revealed by covariance 
analysis and IRT scaling, to achieve not only more meaningful or . 
interpretable measurement but more efficient measurement. 
Thug, basic performance objectives in a field may be g^ctjvely 
measured by structured sets of exercises smaller than fHose~cuf- 
rently used. This would leave more, opportunity for the develop- 
ment and use of innovative exercis"es:"and for the assessment of 
higher-order subject-matter skills siich as organization, integra- 
tion, and strategic planning— as for example, in science. These 
measurement efficiencies, will also serve to reduce the number 
of exercises needed for effective coverage of subject .matter in 
any . one assessment wave, thereby" yielding^important cost 
efficiencies: • • 

The new design retains the capability for analysis and reporting 
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at the level of single exercises as well as aggregations of exercises 
—but addS; by means of covariance analysis and irt scaling, the 
critical capacity .both (1) to construct and evaluate aggregations of 
exercises in psychometrically responsible fashion and (2) to 
report the performsuce of different population subgroups on 
scales having a common meaning across subgroups, age levels, 
and time, periods. The use ..of common scales linked across age 
levels, and afcross time periods enormously simplifies analyses of 
changes and trends over time while simultaneously yielding 
more powerful results and straightforward interpretations. More- 
over, since both exercises and population subgroups are placed on 
th? same scale, results may be interpreted and reported in either 
criterion-referenced terms, norm-referenced. terms, or both con- 
jointly. V 

Finally, the new design adds the important capacity to correlate 
knov/ledge and skill dimensions with each other as well as..with 
attitudes, interests, background characteristics, and both school 
and program descriptors, thereby making possible a variety of 
structural and "causal" or path analyses. - 

This capsule summary of the critical features of the proposed 
redesign of nalp will now be systematically expanded so that 
measurement, analysis, and cost-effectiveness issues may be ad- 
dressed in detail. 



Data Collection Design Features 

The fundamental weaknesses of naep are not in the technical 
quality of its output, whiclMs generally high, but in the limita- 
tions of its design and its adherence to procedures of questionable 
Wsr-bTnHrtTTKese"weirn^^^^^ 

and impiediately as possible with due concern for links to past 
data but not so much concern for past history that the need for 

Data Collection Schedule ^« v 

One of the major reasons that naep has not becoiye a truly useful 
indicator of. educational progress is that assofited assessment 



cycles of three to nine years^which have been characteristic of r 
NAEP in the past, are too infrequent and sporadic either to keep 
pace with educational change or to keep the public's atteution. 
Worse still, the schedule of subject-matter assessment does not 
systematically track the student cohorts as they move through, 
the age levels used in sampling and reporting, 'so that cohort dif- 
ferences are confounded with educational change. 

With respect to cohort differences, if a given subject area were 
assessed in four.-year cycles— that is, with three years intervening 
between assessments of that area then the curr^ent sample of 17- 
year olds assessed in mathematics, for example, would be from 
the same student cohort as the sample of 13-year olds assessed in 
math four years earlier and as the sample of 9-year olds assessed 
in math eight years earlier. Sirhilarly, the current sample of 13- 
year olds would be from the same student cohort as the sample of 
9-year olds assessed four years earlier. By thus matching the 
assessment intervals to the number of years intervening between 
the age levels sampled, cohort differences in a given subject area 
are essentially controlled and interpretations of trend analyses are 
simplified. 

To rectify these problems of timeliness and cohort matching in 
a cost-conscious way, the proposed redesign entails a planned 
schedule of naep data collection every other year, with reading - 
being assessed biennially. The other two basic areas of mathe- ^ 
matics and writing are assessed in alternate waves in four-year 
cycles, as is science and. possibly literature. Because* ofolegal' 
requirements and prior commitrnents, it is proposed that reading, 
writing, and citizenship/social studies be assessed' in the first 
year of the redesign (the 15th year of naep, 1983-84|, but that 
thereafter four subject areas-be covered in each wave so as to. 
shorten the assessment cycle lEor the remaining learning areas. 
This proposed assessment schedule is summarize^ in Table 1 for 
Ihe tirstrf ive years of the redesign. r-" ^ 

The biennial assessment of reading heightens the pace with 
which at least one important barometer of national educational . 
progress can be brought before the public and the educational ' 
community. In'a simple variant of this design, the two basic areas • 
of reading and mathematics would be assessed biennially, which 
might be possible without sacrificing timely coverage of other 
areas once the measurement. efficiencies discussed below are 
realized. 

The biennial assessment of at least one ^subject area such as 
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Table 1 

Assessment Schedule for Subject Areas 



Assessment Year 


Subject Areas 






iSth 1983-84 


Reading; 


Writing 


Citizenship/ 
Social Studies 




16th 1984-85 


Special Studies 


17'h 1985-86 


Reading; 


Math 


Science 


Area A 
e.g., Career and Occu- 
pational Development 


18th 1986-87 


Special Studies 


19'h 1987-88 


■ Heading 


Writing 


AreaB 
e.g./ Literature 


Area C 
e.g., Music/Art . 



reading also provides an important technical benefit. Although 
the .13-year old and 17-year old samples collected in assi'essment 
.year 19 are from the same student cohort as the 9-year old and 13- 
year old samples, respectively ,xollected four years earlier in year 
15 (which is also true of year 21 samples versus year 17 samples), 
year 17 represents a different cohort from year 15. Thus, succes- 
sive waves represent different student cohorts while alternate 
waves, being spaced at intervals matching the differences in age 
levels, represent the same student cohort. However, with the 
asses^ent of reading common to'successive waves, cohort dif- 
ferences can be appraised and calibrated, as it were, and trend in^ 
terpretatipns modified accordingly. 

• The assessment schedule given' in Table 1 applies to the three 
major samples used in naep— 9-year olds, 13-year olds,^..and inr 
school 17-year olds. Al through it is important to return to the 
practice oiE sampling out-of-school 17-year olds and adults-, it is 
recommended that more cost-effective means be employed for ac- 
complishing this, such as the use of the Current Population Sur- 
vey of the Bureau of the Census, as discussed in the subsequent 
section on sampling. 

The proposed plan attempts to.offset the increased cost of cov-. 
ering four subject areas per wave by deliberately scheduling off 
years with no data collection every other year. These off years are 
to be devo.ted to intensified exercise development, data analysis, 
report 'writing, "and dissemination. They are also available for 



special- studies financed through additional resources from a vari- 
ety of sources. A number of such special studies are briefly de- 
scribed in a later section. Special. assessment probes in new sub- 
ject areas could also be conducted during these off years, again 
with additional financial resources. But with the capability for 
correlating across subject areas discussed below, there are advan- 
tages to coordinating special ;jrobes with the assessment of poten- 
tially related or mutually facilitative fields. For example, fr6m 
the standpoint of illuminating connections and transfer across • 
fields, it would be advantageous to schedule a special probe for 
computer literacy in year 17 (or 21) when mathematics and sci- 
ence are assessed. 

Additional cost-effectiveness further bvittressing the feasibility 
of the proposed schedule derives from the measurement and 
sampling efficiencies discussed in the later section on spiralling. 
Since sample size is the major determinant of data collection^ 
costs and since the number of exercises answerable in a fixed 
amount of time drives the number of booklets which in turn 
drives sample size, improvements in measurement efficiency per- 
mitting effective subject coverage, with fewer exercises has im- 
portant cost consequences, as would the .negotiation of increased 
time.per student for exercise administration. 

Sampling 

The proposed redesign retains the current deeply stratified three- 
stage sampling plan as modified to meet some new purposes in 
addition to the old. The first stage tff sampling entails classifying 
the primary sampling units or psus into strata defined by geo- 
graphic region and community type. The psus aro'-typically coun- 
ties, but small counties are aggregated so that no psu has fewer 
than an estimated 1500 youths at each assessment age. For each 
age level, the second stage entails enumerating, stratifying, and 
selecting schools, both public and private, within each psu 
selected at the first stage. The third stage involves randomly 
selecting students within a school for participation in naep. For a 
typical assessment session, from 16 to 25 students of the sarne 
age— either 9-, 13-, or 17-year olds---are assembled to respond to 
the exercises in a particular booklet. 

Originally, saniples of 17-year old dropouts and early graduates, 
; as well as of adults 26 to 35 years of age, were located in their 
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homes where one or more assessment booklets w,ere adminis- 
tered. Recently, however, limited budgets have led to less fre- 
quent assessment of the adult group as well as the out-of-school 
, 17-year olds, the latter loss being much more serious because of 
the biases entailed in estimating 17-year old performance from in- 
school samples alone. 

The three sampling stages, with certain exceptions^ are accep- 
table for the, proposed naep redesign. Some minor procedural 
modifications are needed at the third stage to accommodate (1) 
the BIB spiralling variant of matrix sampling, (2) grade-level as 
well as age-level sampling," and (3) the fuller documentation of 
the numbers and types of students excluded from the sample as 
not validly testable by present naep. means. A modification is' 
needed in the sampling of psus and of schools in order to improve 
the representation of Hispanic cultural subgroups and to permit 
the systematic reporting of Hispanic performance separately. 
Sampling of students by grade level, documentation of function- 
ally-handicapped students excludable from the naep sample as. 
untestable, and representation of Hispanic cultural subgroups are 
discussed next in turn; bib spiralling is treated in detail in the suc- 
ceeding section. 

Sampling by grade a§ weltas by age. The restriction of naep to 
age-level sampling and reporting makes it difficult if not-irnpossi- 
ble to link national assessment results to school practices, state 
and local assessments,... and educational policies, most of which 
are typically tied to grade level. This is one of the main reasons 
that NAEP results are less directly useful than they might be for 
educational purposes. Accordingly, even though the meaning of 
grade level varies in' different parts of the country depending on " 
the age at which children are admitted to school and on the ad- 
vancement and retention policies of Iqcal school systems, it- 
seems imperative that grade-level sampling and reporting be in- 
corporated into NAEP but not at the expense of eliminating age 
sampling. 

There' are also important reasons for sampling by age, not the 
least of which is that age has a common meaning acroiss geo- 
graphical regions and school practices. Another critical reason for 
not relying on grade sampling alone is that many disadvantaged 
youth are overdge /or their grade placement, which y/ould seri- 
ously distort the meaning of average grade-level performance and 
seriously compromise the interpretation of grade trends as indica- 



tions of educational "progress." Taken together, these arguments 
imply that naep sampling and reporting should be by both age and 
grade. ' . , . 

The addition of grade sampling is not a minor embellishment 
to age sampling but, rather, a distinctly different though coor-. 
dinate perspective for characterizing educational achievernent 
and change. According to figures ffom a recent report of.the^ 
Bureau of the Census, only about 70 percent of 9-year old stu- 
dents are in grade 4, which is their mpdal grade, and a roughly 
similar percentage of students in grade 4 are nine-year olds, 
which is the modal age in that grade. Similar percentages hold for 
13-year olds and grade 8 while somewhat lower percentages ob- 
tain for 17-year olds and grade 12. Hence, age and grade sampling 
and their , associated analyses provide critical counterpoint to 
each other jn disentangling the import pf performance levels and 

■ trends. In addition, fallowing the lead of Truman Kelley i 1940), 
special analyses of the "ridge" of studentsof modal age who are 
in their modal grade might provide useful norms for many com- 

^ parative purposes, although they might also be simplistic for 
other interpretive purposes. 

Documenting sample exclusions. Although naep is meant to be 
a barometer or report card on the national condition of .education, 
past implementations have excluded significant populations of 
students from data collection in particular, limited-English 
speaking and functionally-disabled pupils. The exclusion of these 
populations has significant implications for naep both because of 
their size and the resources invested in their members' educa-^ 
- tion. While the exclusion of these populations limits the generali- 
ty of the NAEP report card, such-exclusion-is~understandable- 
because many practical and theoretical issues exist in the assess- 
ment of both handicapped and non-English proficient studeiits. 
In the past, naep has dealt with these issufes by directing local 
""TcKoolT^efsonner^ three gi:oss 

categories: limited-English speaking, functionally disabled, and 
educable mentally retarded (Research Triangle Institute, 1979). 
Criteria for determinirig membership in these categories has been 
left primarily to the judgment of the local school districts. Data 
collected on these excluded cases appears to have been limited 
solely to the number of pupils falling within each broad Category. 
These categorizations obviously provide precious little.^^ 
tion on exactly wiiQ is being omitted from the NAEP program. 
Kridwing who is being excluded from mEP. is critical for a^ 



two rccasons. First, without such information; it is difficult to 
. know precisely whom the naep report card ^oes and does not 
represent. For example, naep data on Hispanics may not be repre- 
sentative of Hispanic youth as a whole due to the exclusion 'of 
non-English proficient students from data collection. Second, if 
the NAEP barometer is truly t6 represent the national condition of 
education, we must eventually find meaningful and practical 
ways to assess currently excluded populations. Detailed descrip- 
tion of these populations is a necessary first step in developing 
workable assessment strategics for them. Since much of. the 
n^eeded information is contained in student records that can be 
consulted by school officials and trained data collectors as part of 
the process of identifying students to be excluded from the assess- 
ment, its systematic collection would be facilitated by the devel- 
opment of a form for characterizing excluded students along a 
nurnber of important background dimensions. 

The proposed form would include such pupil and program 
descriptors as age, sex, ethnicity, languages of the home and fre- 
quency of use, current program (duration, setting, percent time 
. mainstreamed, related services, pupil/teacher ratio, primary goal 
areas, languages used in instruction, percent of instruction in En- ; 
gtish); years of previous special or language instruction, type and 
severity of handicapping condition, and specific reason for exclu-^ 
sion from naep. 

Within the proposed naep redesign, three major uses of these 
types of data are envisioned. First, such data will provide a mean- 
ingful characterization of students excluded from naep samples 
and hence from generalizations about results. Second, this char- 
acterization will be compared with other characterizations of 
handicapped and limited-English speaking students fcrm^laTed 
from existing data bases (e.g., those generated through periodic 
surveys conducted by the Office for Civil Rights, the National 
Center for Education Statistics, and the Annual Child Count of 
PL 94-142). This comparison should suggest the extent to, which 
special segments of the handicapped or non-English proficient 
populations — such as the learning disabled. — are being served by 
NAEP. Finally, the data collected will be employed as the basis for 
a proposed strategy for assessing traditionally excluded groups in 
future years, a strategy discussed at greater length in the later sec- 
tion on Special Studies. ' 

Sampling Hispanic students. Giveri the increasing size of the 
Hispanic, population in the United States and the distinctive edu- 



ERIC 



cational problems of Hispanics refated to bilingual and bicultural 
background, Hispanic results should not be averaged together 
with those of other groups but rather should be 'analyzed and 
reported separately, as ha's been done to sor^e degree in recent 
NAEP reports. In doing this, however, it would be important to at- 
tain representative coverage of the major Hispanic cultural sub- 
groups—Puerto Rican, Cuban, and Mexican American— because 
of differences in their social and migrational histories that have 
implications for their educational progress. Since these groups are 
differently distributed throughout the country, this implies some 
modification of the sampling plan. This change in the sampling 
procedures would not be initiated before the secoiid assessment 
wave in the naep redesign (1985-86), when it would first be possi- • 
ble to influeiice the various stages of the sampling design. 

In addition, there remain two other sampling issues that war- 
rant further discussion, each entailing possibly cost-effective 
compromises with current or former procedures. One involves a 
Strategy for administering naep exercises to adult samples and 
possibl/Jto out-of-school 17-year olds. The other involves the en- 
listment pf cooperating schools for repeated participation in i^aep. 

Sampling adults. Since competent adult functioning in society 
is an ultimat^''goal of educational progress, it is important for 
NAEP to return to the practice of sampling adults. Furthermore, 
since estimates of 17-year old performance based only on in- 
school samples are inevitably biased, it is important to mclude 
out-of-school samples as well. It is proposed that cost-effective 
means for accomplishing this be seriously investigated, such as 
the use of the Current Population Survey of the Bureau of the 

Census. ■ . , , , j . 

Every month the Census Bureau surveys 70,000 households to- 
ask a variety of questions, using a continuously rotating sample. 

' All contacts are made during the same week of each month by 

'"'T5WTfarnea"patt--llHrei)ermiinTOt-emplo^^ 

phone each of the 70,000 households. Each household is used 
eight times during a 1.6-month period-households are m the 
sample during four consecutive months in one year, out eight ■ 
consecutive months, and back in the sample the same four calen- 
dar months the next year. Each month there is a 75 percent over- 
lap with the previous month's sample. The.sample is highly 
stratified using Census data-psus are generated at the county 
level and about 250 areas are in the sample with certainty, some 
160 of which are metropolitan areas. In addition, approximately 
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3V5 other psus are sampled; about 40 of which arc metropolitan 
areas. The samples are updated monthly using construction and 
buMding-permit records or, where those are not available; actual 
physical inventories of housing units are listed. Each October the 
schobUciux)illment study is conducted; and nces is considering be- 
coming a regular co-sponsor of that effort. Non-Census govern- 
ment agencies may participate in the Current Population Survey 
with supplementary inquiries, but are limited to fifteen minutes 
per interview in a particular month. 

Preliminary inquiries indicate that naep, as a government- 
sponsored program; is eligible to participate and that administra- 
tion of subject-matter exercises is considered to be feasible, 
although they might be restricted to the concluding segment of 
interview sessions. Since in-home administration of naep exer- 
cises would require special training of the interviewers, the 
expected lead time might exceed the current estimate of six to 
seven months. Moreover,-since the collectiop of labor-force par- 
ticipation information for the Department of Labor is a major part 
of this service, it might be possible to relate educational achieve- 
ment measures on samples of adults and out-of-schobl 17-year 
olds obtained by this means directly to indicators of labor-force 
participation and employment trends. 

Repeated school participation. When independent samples of 
schools are drawn in successive assessment waves, school-to- 
school differences in average performance level contribute part of 
the sampling error in the measurement of changes over time. 
Therefore, sizable reductions in sampling error could be attained 
if the same schools participated in successive assessment waves. 
From the standpoint of both»sampling efficiency and school con- 
tact costs, it would seem ideal to recruit schools to participate in 
four successive assessment waves, with a fourth rotated out and 
replaced by a new sample of schools each wave. Realistically,' 
however, this strategy might produce gn unacceptably dclcteii-', 
ous effect on the coopei.^tion rate. Furthermore, participation of a 
school in one assessment wave might affect its performance in 
succeeding waves: TheVefore, although this strategy should be 
seriousJy investigated, it does not seem highly promising and 
should be carefully evaluated before proposing its implementa-' 
tion. ^ 

A compromise between independent sampling of schools in 
successive waves and repeated participation of the same schools 
may prove more feasible and still substantially improve efficien- 
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cy. This compromise entails rotation of psus and schools in tl^c 
sample so that 50 percent of the psus and schools are identical In 
two successive assessment waves for the same subject area. The 
advantage of this compromise over the current naep approach c^f 
independent school sarnpling is that it sshould substantially 
redute the sampling errors of measures of change over time. Thdt^ 
iS; schools make^ important contributions to varijlnce for any 
given assessment wave and; with independent samples in succes- 
sive waves, school contributions to the variance of the differences 
are essentially doubled. With an identical sample of schools in 
the two waveS; these contributions to the variance of change arc 
reduced by a factor of (l-r), where r is the average within-psu coiV 
relation between the years for the particular exercise or aggregate 
being estimated for the identical schools. Since r is often as larg^ 
as .7 or more, worthwhile efficiencies are achieved in estimated 
of change. A rotating sample witl> 50 percent of the schools idenj 
tical from one wave to the next would achieve about half of thi^ 

benefit. ' . / 

Unless school cooperation can be retained at substantially the 
same level under this procedure and unless participation inione 
wave affects performance in the next only moderately at most, 
this compromise strategy should not be adopted. However, since 
a rotation group effect observed in several studies tends to ap- 
proach a modestly biased but stable level over time, this compro- 
mise strategy should be care;fully iappraised. It might prove feasi- 
ble in connection with state assessments, in which cooperating 
states could arrange for school participation in a two-wave rota- 
tional plan. 

Balanced Incomplete Block (BIB) Spiralling ^ 

The theoretical basis for the current method of assigning exer- 
cises Wes^^^^^ sampling was developed at Educa- 
tional Testing Service by Frederic Lord (1955, 1962). Matrix sam- 
pling, as implemented by naep, entails dividing the exercise pool 
for a given age level into different assessment packages or book- 
lets such that each package Contains about as many exercises as a 
student can answer in the given time period. The packages ore 
discrete in the sense that an exercise that appears in one package 
does not usually appear in another, although exercises often ap- 
pear in. other packages at a different age level. This method of 
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matrix sampling is adequate for estimating the proportion of per- 
sons in a population who can respond correctly to an exercise. It 
is not adequate for determining the structure of performance con- 
sistencies in a subject area or for estimating levels and trends in 
composite variables creaN*tWrom exercises in different assess- 
ment packages. 

Another technique for distributing exercises to respondents is 
conventional spiralling^ which has long been used by Educational 
Testing Service in its major testing programs. As an example, 
each Scholastic Aptitude Test (sat) contains one section that does 
not contribute to an individual's sat score but is used instead for 
introducing new and innovative items and for linking the present 
test form with past and future forms of the sat. Although each in- 
dividual takes only one such variable section, it is possible to ad- 
minister a number of different sections in a single sat administra- 
tion. This is done by spiralling the variable section— that is, test 
booklets are assembled so that, say, the first booklet has variable 
section 1, the second booklet has variable section 2, and so forth, 
until all variable sections have been distributed and then the pro- 
cess is repeated. Since examination booklets are assigned to indi- 
viduals in the order in which they are seated in the examination 
room, administration is easy as long as the variable sections all 
require the same amount of time. Pre-coded answer sheets are in- 
serted in test booklets so that the different sections are distin- 
guishable by scoring machines. ' ' . 

The proposed naep redesign entails a modified data collection 
procedure that combines the advantages of matrix sampling with 
those of conventional spiralling. This procedure, which is called 
balanced incomplete block (bib) spiralling, is an extension of 
ideas expounded by Knapp (1968). Essentially, it involves devel- 
oping a balanced incomplete block design such that each exercise 
is administered the same number of times as it would be in ma- 
trix sampling, but in addition each pair of exercises is also as- 
sessed a prescribed number of times. This means that each exer- 
cise will be located ii: several different packages or booklets, so 
that many different packages must be printed for an exercise pool 
of a given size. The bib spiralling of exercises also implies that 
many different packages, and thus different sets, of exercises, will 
be administered in a particular assessment session. 

BIB spiralling and matrix sampling. Ah example contrasting or- 
din'ary matrix sampling with the bib spiralling variant of matrix 
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SiimpllnK win Illustrate their differences. Consider A rendlnR 
assessment for age 13 and assyme that the assessment pool con- 
tains 165 exercises. Assume further that a 13-ycflr old can do 33 
reading exercises during the allotted assessment time and that 
the sampling plan calls for 2,100 13-year olds to take each exer- 
cise. Although these assumptions nre arbitrary, they arc reason- 
ably close to what would be expected during a typical assessment 
wave. These particular numbers were chosen to simplify the 
arithmetic below. 

The matrix sampling approach as employed by naei> would 
divide the exercise pool into five different packages of 33 exer- 
cises each. Each different package would be bundled Into sert" 
containing as many copies of that package as there arc students 
expected in an assessment session. A selected school would be as- 
signed one or more assessment sessions arid would receive one or 
more different sets of packages accordingly. Following past prac- 
tice, no school would receive all packages. For each assessment 
session, a different random sample of students within the school 
would be selected and scheduled. All students in a given session 
would receive the same set of exercises because of the current. 
NAEP practice of taped aural presentation and pacing. A saitipling 
and management plan is needed to assure that each set of 
packages is administered an appropriate number of times within 
each psu. The total assessment would include five packages each 
administered to 2, 100 youths or 10,500 students in all. 

Next, consider the bib spiralling approach. First, it is clear that 
a distinct package cannot be developed for each possible combina- 
tion of exercises since the number of combinations-of 165 exer- 
cises taken 33 at a time is astronomical. In the balanced incom- 
™plete-block-app;oachr-howe.vei,>thejExerdSM_ca.nj3ejcombi.o^^ 

into 15 discrete blocks of 11 exercises each, and these blocks of 11 
exercises can be permuted such that each pair of blocks occurs 
together in at least one package. Under this plan, many more dif- 
ferent packages would be printed, although the number of stu- 
dents taking each exercise as well as the total number of students 
assessed \Vould be the same as in the present naep matrix sam- 
pling plan. ,-..£. 

A balanced incomplete block design that fits these specifica- 
tions is shown in Table 2. The blocks of 11 exercises are num- 
bered from one to fifteen. Each row of the table shows the nu- 
merical designation of fhe blocks that would be .contained in a 
particular package; the left-hand set of columns shows the blocks 
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firtlrtncal Incomplete lUock Iknl^n for 16,^ lixcrclfics 
In ^^ llooklctb coniprlNlnK nxcrciscft P.ach* 

hiMiktci No. Simple niock Order Random lllock Order 
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in -simple order and the right-hand set shows the same design 
with the package numbers randomly recoded and the rows and 
columns randomly permuted. This design would require that 35 
different assessment packages be printed, each package contain- ' 

' ing three blocks of 11 exercises: ' * 

• Examination of the table indicates that each block of exercises 

^occurs in exactly seven packages and that each pair of blocks oc- 
curs in precisely one package. If each package is administered 300 

^ times, then each block of exercises will be presentid^tB^^ 100 dif- 
ferent students. An exercise in one block will.be administered to 
the skme students as an exeriisTitrSnother block 3D0 times. The 
total assessment would include 35 packiages times 300 students., 
for each package or 10;500 13-:year olds in all, the sanie number as 
in the matrix sampling design. 

Moreover, bib spiralling simplifies the administration of assess- , 
ment sessions. Under the present naep application of matrix^ 
sampling, care must be taken to distribute the correct packages 
within psus. Consider now that the 35 different paclc;ages' in the 
BIB spiralling example are merged in a random sequence and that 
the* same sequence is repeated for all sets of 35 packages. If for a 
target assessment session of 25 students the packages are assem- 

' bled in consecutive sets of 26 or 27 packages, then each session 
will have enough packages for the scheduled students and one or 
two extra in case of special situations. Uiider this cycling system, 
each package will be first in a set an equal number of times and 
the packages not used at the end of a set will be bala^ced over all 
sets. Thus,, within a psu the only consideration is the number of 
assessment sessions or the total sample size, and the particular 
.packages administered is hot armanagement issue.. 

ft should also be stressed that bib designs, although not necesr .. 
sarily available for exercise pools of aiiy particular designated 
size, may be readily developed for a wideiarray of sizes".* Indeed,\ 

• we have not yet found a reasonably sized pool for which an apprd^ 
priate design cguld^^ribt be developed in that neighborhood. For 
example, althougET:here is not a good design .fof 100 exercises in^ 
blocks of ten, there" is an excellent design for 99 exercises in 
blocks o'f 11. An pxaniple for 250 exertises in blocks of ten ap- 
pears in the later section on irt scaling. Designs may be of many* 
. types: Latin squares, Youden rectangles, lattices, and so forth 
(Cochran & Cox, 1957). In any event, if no balanced design can be - 
found for a particular case, a sliglitly less efficient imbalanced.^ 
design could be used instead. 
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Although ordinary matrix sampling in this exarnple requires 
only five different packages while bib spiralling requires 35, the 
total number of printed packages or booklets; as^well as the total 
number of printed pages, remains the same. For the extra assem- 
bly tosts, we have assured that each pair of l^locks of exercises is 
administered to a certain number of youths. In this way a com- 
plete cross-products matrix of all exercises can be produced, and 
this matrix can serve a number,bf important functions— such as\^ 
. ascertaining the interrelationships among objectives or perfor- 
mance dimensions, testing the unidimensionality of the mea^ , 
suremeht area or subareas for applications of irt scaling,: and 
delineating the structure of achievement in an area by means of 
factor analysis and multidimensional scaling. It should be noted, 
however, that this cross-products matrix is not quite a standard 
one because its elements are based on different samples- the 
analytic feature^ of this type of matrix are discussed in a later sec- 
tion on cbvariance analysis. 

It should also be noted that bib spiralling is statistically more 
efficient than ordinary matrix sampling for some estimates. By 
administering more different exercises within a particular school - 
and by administering a particular exercise in more, different 
schools, the school clustering effect is reduced and the bib sampl- 
ing design is consequently riiore efficient. Preliminary calcul- 
ations, using reasonable assumptions about the cluster effects now 
common in naep results, suggest that bib spiralling can reduce the 
number of students necessary to attain a given sampling error by, 
about 20 to 25 percent when compared to ordinary matrix sam- 
pling,^ or reduce the standard errors by 10 to 15 percent when 
using the same sample size. 

In the proposed redesign, bib spiralling is applied in the first 
assessni'ent wave only in the assessment of reading. This is 
because data collection for citizenship/social studies is the com- 
pletion of an assessment already begun using the original, matrix 
sampling procedure, and the repackaging of the writing exercises 
as currently constituted is of dubious cost-effectiyeness. How- 
ever, in the next and succeeding assessment waves, bib spiralling 
is to be applied in all subject areas! In addition, bib spiralling will 
be undertaken flcro55"subject'areas to delineate interconnections 
between knowledge and skill in one area and that in another as 
well as to appraise the degree to which higher-order skills cut' 
across afeas. 
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From public use tapes to public useful tapes. Except for simple 
analyses of average percent correct on aggregations of exercises ; 
judged to assess; the same objective, current NAEP :datai based- on: ^^^^^^^^ 
ordinary matrix sampling is inherently booklet-bound: For judgr ;^ 
mental aggregations of exercises that cut across booklets/ analyses ■ 
going much beyond the simple reponing of perforrnapce; leyels^^^^ 
face a major roadblock. Even appraisal of the ernpirical coherence 
and correlates of sujch aggregates must be undertaken one booklet 
at a time, if at aU, This is a serious limitation in priniaxy; naepV 
data analyses, but it is even rriorei debilitating in secondairy analy^^:^^. 
ses base^ on the current public use data tapes:\ y ^ 

Each-data file on these public.use tapes cori^ms the result 
one booklet or package of exercises for one age level of one assessr 
ment wave. This means that even for simple analyses of average 
percent correct that entail aggregation across several packages iii : ; 
a subject 'area? it is necessary to process from 10 to 30 separate 
data files, just to locate all of the exercises written for a parBcular 
objective or containing a particular type of subject-matter content 
tequires an elaboratET^s^^ Worse still, any appraisal of 

the reliability or generalizability of the exercises representing a 
specific objective, as well as appraisals of their construct validity 
vis-a-vis correlations with other objectives or with background 
variables, must be carried out one booklet at a time on whatever 
collection of exercises happens to appear there (Anderson, Welch, ? 
^Harris, 1982; Hambleton, 1982). - ^ , ■ 

.One of the major benefits of bib spiralling is that both primary 
and secondary analyses of naep data are freed from the booklet 
bind. Correlations can be computed among all exercises in a sub- 
ject area. Any aggregation of exercises from whatever combina- 
'tion of booklets can be appraised for reliability or generalizability • 
and correlated with other item aggregations as well as with back- ^ 
ground variables. Similarly, irt scaling can be applied to exercises 
drawn from any set of booklets in the subject area. Secondary 
analysis is also, enormously facilitated by public use data files 
each of which will now contain all of the exercises in a subject 
area easily retrievable by objective measured, by type of content, 
by format, and so forth. In short, bib spiralling makes it possible 
to convert public use tapes into public usEFUL tapes; , . 

Trade offs in taped aural administration. The use of bib spiral- 
ling has one serious implication that must be confronted, which 
is that BIB administration is inconsistent With aural presentation 
and pacing of exercises using a tape recorder. This is not likely to 
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seriously affect the reading assessment, which is only paced by 
tape and not presented Rurally . But each of the other areas may be 
substantially affected. The problem; of course, is that -with iaiB 
spiralling the students are assessed on different packages, and 
aural presentation would result in cacophony at the assessbient 
session, unless expensive equipment such as headphones were 
employed. ^ " 

Since taped presentation and pacing would.be forgone' with bib 
spiralling, the cost-benefits of the trade off must be .appraised. On 
the one hand, poor readers, whether from disadvantaged minority 
groups or not, perform somewhat better with aviral as well as 
printed presentation, while good readers appear not to. be unduly 
distracted on the average— although some good readers are un-v 
doubtedly distracted. On the other hand, aural presentation is ex- 
pensive and requires extra equipment as well as some special 
skills at the assessment session. ^ -v 

Much more important, aural presentgtion and pacing of exer- 
cises is a procedure com^mon to naep but rare indeed iri-other edu- 
cational measurement enterprises. No state or local assessments, 
to our. knowledge, have adopted aural presentation procedures- 
, and it is doubtful that any will. This renders naep procedures and 
hence naep results noncomparable' to . the mainstream of educa- 
tional measurement practice. Innovative procedures such as 
taped presentation are only of marginal vialue if they cannot rea- 
sonably be used in state assessments or other testing programs. 
Gosts aside, the major trade off thus appears to be between im- 
proved measurement validity for some students and comparabili- 
ty of results of all students. Our conclusion, therefore, is that 
aural presentation and pacing of exercises has questionable cost- 
. benefit while bib spiralling hasj considerable and multiple cost- 
benefits. ' . " \ 

However, since the same exercise presented by printed page 
alone will almost certainly have different properties than. when 
.presented aurally as well, past naep results cannot be expected to 
be comparable with those obtained in the redesigned naep if tape 
presentation is eliminated. For this reason, statistical links must 
be established to the past data in each area to maintain the cap- 
ability for trend analysis. Procedures^for establishing these links 
are discussed in the next section. 

In summary, at the cost of increased printing and assembly ex- 
penses and the aural presentation of exercises, bib spiralling sim- 
plifies administration, reduces sampling error, and provides the 



ability both to determine the dimensionality of a subject area and • 
to develop scales using the most powerful available methodology. 
It also results in data that can be more usefully organized on 
public useQpes and more meaningfully described in reports and 
in the public media. 

Statistical Links to Past Data ^ r 

As has just been emphasized/ changes in the method of presenting 
exercises may affect the probability of a correct response for some 
students ajid hence the proportion of correct responses for various 
groups. Thus, comparisons of proportions or of average percent 
correct over the time interval spanning the ndet hod change could 
be misleading because method differences are confounded with,; 
educational trends during this period. Yet/ not changing the 
method of presentation commits naep to a perpetuation of expen- 
sive procedures that restrict the comparability and utility of naep 
results and hinder the implementation of powerful innovations 
like BIB spiralling. The solution is to forge statistical links to the 
past so as to permit translations from past data based on one 
method to new data based on another. The capacity to make such 
translations would effectively maintain the integrity of trend 
analyses across the method change. This statistical linking/ theU; 
is a means of both preserving what has been done in the past and 
of moving responsibly into new methodology . ' : ! 

Equating samples. The proposed statistical link essentially 
requires an equating study in which data are collected on some ^ 
student samples by the past method and on other student samples 
by the new method during the same assessment wave in each 
affected subject area: Xhere are three types of data sets at issue in 
this equating strategy: ^ . 

Set A contains data from past assessments collected using taped 
-pr^entation procedures. These are the data whose usefulness for 
trend analyses we wish to preserve. 

Set B contains data from a future assessment wave.collected us-" 
ing precisely the same taped:presentatioii methods as in set A, 
Since the.data in sets^A'and'B were collected by the same method" 
but at different times; any differences between them are attrib- 
utable either to educational change or to sampling enor. Since 
sampling enor can be. estimated/ so can the amount of educa- 
tional change. ^ ' 
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Set C contains data^collected in the same assessment wave a^^ 
set B but using the new methodology: Since tiie data in se.t 
C were collected not only at different times but by different methr- 
pdS; any differences bet weeri them are attributable to method dif- 
ferences as well: as to educational change and sampling error. - 
More information is needed to. disentangle these three compon- 
ents and thereby render the data in set C comparable to that in set 
A in the estimation of educational change. ' — 

The leverage for solyingthis problem comes from the data in 
set B. If B and G are based on random samples from the sarhe pop- 
ulation, then the differences between them are attributable only 
to method differences and sampling error . Since they were .collec- 
ted in the same assessment wave, they do not differ in education- 
al change. The data sets B and C can be used to estimate the effect 
of methiDd, thus disentangling method differences and educa- 
tional change in comparisons with set A. <^ 

Therefore/ whenever a substantial change in data collection 
methodology is introduced, the naep redesign entails. an equating 
study to estimate the effect of the method change. Essentially; 
this involves collecting data by the old method as well as the new 
for differentrandom samples from the same population of youth. 
Data need be collected by the old method for set B on only those 
exercises from set C that are repeated from past assessments^ 
Data collected l?y the new method for set G "should be based on' a 
full-sized sample, of students so that sampling error is not in- 
creased and so that set C is directly comparable to future dat? col- 
lected by the new method. It is proppsed that set B be based on 
half-sized samples, however; which our calculations indicate 
should be sufficient for equating purposes.. , 

Equating methods. Composite variables comprising several ex- 
ercises can be easily and straightforwardly equated using standard 
^methods, such as equipercentile equating, provided that some. of 
the blocks used in spiralling for set C are constrained so that 
packages in set B can be composed of sets of those blocks. More 
powerful equating methods using item parameters from ir,t scal- 
ing are applicable only if the m<:thod effect is small. If the method 
of presentation affects low scorers more than high scorers, th(pn 
the requirement that the logistic function (relating the probabil- 
ity of a correct resporise to proficiency or ability) should be the 
same in both groups would clearly be yiolate^i. Equipercentile 
equating would avoid this anomaly.and provide a siihple and clear 
comparison of composite scores obtained by'the two methods.^ 



Equating of single exercises is less straightforward and requires 
some psy cHome trie development . At a minimi um , the differences 
in- response proportions due to chaiigmg methods of presentiatioh 
and to sampling enor wouW be described; to^^^ 
nonlinear function fit to the proportions found by the two methT/ ; 
ods could be used for purposes of translation and adjustment / >; > 

This equating-sample approach also has' the adyantiage of pro- ;■ 
;tecting against unforeseen problems; If the^change m 
presfentation radically and massively affects the results, then the ' 
equating sample B is available, for cpmparison with the past and; 
continuity is r&aintained, albeit with a larger^ sampling^enror ^. 
^because of the smaller sample size. In this: case,j^a decision woiald^ 
need to be made as to which method to use in the futme.^^^tf^^ 
were decided to continue the former method; then full-sized sam- 
ples would be collected by that method in ; futme waves . In the 
more likely case of deciding in favor of the new inethod-r^because 
of cost-effectivenesS; analytic power, and comparability of results 
to other assessment programs—then trend diata would be plotted 
discontinuously. The earlier data would be presented along a time 
line ending at the point conespondiijg to the equating sample, 
and the later data would be plotted along^the same time line but 
beginning with a different value based on set C- ^ ; ; 
' In the proposed naep redesign, only reading will be spiralled 
during the initial wave; writing and citizenship/social studies 
will be administered by the piast matrix sampling procedures with 
taped presentation and pacing. Since in the past, reading exercises 
have not been aurally presented by tape but have been- paced jby; 
tape; an equating sample is proposed to appraise! the .effect 
changing from paced to unpaced : administration. Because thie 
method effect of pacing is likely to be small, at least in compari-- 
son with the effect of aural presentation, irt equating may ^^:b^^^^^^ 
applicable; hence, it may be possible to represent time trends on a ; 
common performance scale spanning past and future reading 
data. In future assessment waves, spiralling is contemplated for 
all subject areas, and an equating study will be undertaken for 
each area as it is introduced. > 



38 



Analysis Design Features 



The introduction of balanced incomplete block ispiralling into 
data collection has profound implications for data analysis. To 
begin with; it makes possible the computation of coyariances 
among exercises within a subjeqt area and; in future.assessinerit 
waves; across subject areas as well; In addition, it facilitates 'the,^ 
application of IRT scaling to exercise^ in different packages; there- 
by yielding scales that span a subject area and, ultimately/ scales 

; with a common meaftihg that span age levels and time periods as 
welL The integrative properties of irt scaling in turahave power- 
ful implications for trend analysis and for correlational and 
"causal" or path analysis of relationships between performance 

s scales and background, attitudinal, and program variables. 

Covariance Aiiialysis 

Siiice BIB spiralling assures that each pair of exercises is respondied 
to by a specified number of individuals, a covariance matrix can^ 
be computed among all of the exercises in a subject area and, in 

. future assessment WaveS; between exercises in one area and those 
in other areas ^ assessed at the sanie time. In light of this latter 
capability; it would make sense to select subject areas for a partk- ; 
ular assessment wave that are mutually facilitative; like science 
and mathematics; so that the transfer relationships of knowledge 
and skill can be appraised.. ^ • . 

The availability of covariances anjong exercises provides a num- 
ber of immediate benefits. First; it contributes to construct vali- 

c datioB (Cronb^h; 1971; Messick; 1975; 1980) in that the coher- 
ence of exercisies designed to measure the same objectives can be 
empirically evaluated; as can the degree to which an exercise 

' relates to other objectives for which it was not intended. It is pos- 
sible; for example; that sl graph-interpretation problem in social 
studies is more closely related to mathematics exercises than to 
other types of social studies exercises. Fromi this discriminant 
aspect of construct validity; a second benefit of covariances is ob- 
tained— namely, undesired method variance can be detected and 
corrected. Thus, by identifying exercises that assess the same 
dimensions or objectives regardless ofexercise fornlat; the gener- 
alizability of interpretations becomes empirically grounded. This 
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is not to imply that graph interpretation shoul'd be excluded from 
the assessment of social studies, but rather that it should not be 
combined with social studies exercises that measure a difto 
dimension of knowledge or skill. Contrariwise; it does sugg^^^^ 
that the content coverage of graph problems in matK^ 
could be enriched by inclusion of social sttidies^m 
event; the decision about what kinds of exercises to included 
assessment must be based both on expert judgment about rele- 
vance and coverage and on demonstrated response consistence 
in student perform!ance (Loevinger; 1957; Me 

A third benefit of covariahcesas econoniy of nieasuiremeiit. By. 
empirically grouping sets of exercises that reliably; assess a; com- 
mon-dimension or objective; composite scores ciabe^^u which 
entail smaller sampling~errors. Preliniinary calculations indicate 
that; by going from one exercise to a composite of teil exerciseS; 
iampling error is cut roughly in half but that further reductions in 
sampling error diniinish ^s the number of exercises in the com- 
posite increases. With covariances available; item analysis proce- 
dures could be used to refine large composites to optimal or cost- 
effective levels of reliability and sampling efficiency. ^ ^ 
\ln short; covariances provide an empirically-grounded concep- 
tual basis for defining, meaningful scales and scores. This will 
m(^ye naep from the level of statistical description of performance 
on single exercises or unverified judgmental aggregations of exer- 
cises to the level of meaisurement. ! 

The structure of educational achievement. Moreover; the en- 
tire matrix of intercovarianc^S; or selected submatriceS; can be 
analyzed by such multivariate methods as metric and nonmetric 
factor analysis and multidimensional scaling to ascertain the 
dimensional structure of performance in the domain. In this con- 
nection; it should be noted that the covariance matrix generated 
via BIB ispiralling differs from the usual covariance matrix in that 
its elements are based on different random samples of individu- 
als. This means, that the overall matriX; because of sampling and 
/measurement errorS; may not be consistent with cross-products 
^generated from any single set of real scores. If this is the casC; 
principal components analysis of the covariance matrix would 
yield at lea^st one dimension having a negative sum of squares; a 
mathematical inconsistency indicating' that the matrix is not ap- 
propriately analyzable by standard multivariate methods. How- 
ever; an effective solution is to estimate a population covariance 
matriX; which will always be consistent and hence analyzable by 
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standard methods (B. Wingersky, .1982). Therefore/ covariance 
matrices based on bib spiralling will be tested for consistency and 
a4justed accordingly, if necessary, beforeaihdertaking dimen- 
sional analyses, u 

Another technical difficulty wan:ants furthef"comment. to bi- 
nary response data of the type obtained with exercises scored cor- 
rect or incorrect, the covariances are distorted by curyilinearities 
in the relationshipietween exercise responses and the underlying 
performance dimension. If the exercises vary only nloderately in 
difficulty level, this problem is handkd by using tetrachoric^c^^^^ 
relations, esj^eeially if they .are corrected for the effects o^ 
ing (Carroll, 1961). But if the exercises differ widely in difficulty, 
it may be necessary to use alternative approaches such as non- 
linear factor analysis (McDonald, 1983) or methods that attempt 
to fit the factor model directly to the binaryldata (Hicker, 1983). 
Since this problem may be effectively finessed by factor analyzing 
not item scores but composite scores for small exercise clusters, 
this approach will be applied as well, using both empirically-veri- 
fied rational composites of exercises and those derived by homo- 
geneous-cluster keying. 

Appropriate factor analyses of covariance matrices among exer- 
cises will be employed in the naep redesign to ascertain the 
dimensional structure of each subject area. The performance 
dimensions isolated will be compared with the objectives speci- 
fied in exercise construction to identify any commonalities. 
Those dimensions that cut across the original objectives wiU be 
carefully examined to see if process interpretations can be educed 
suggestive of new, more processroriented objectives or of higher- 
order skills: Depending on the outcome, the factor analysis may 
thus yield dimensional scores for existing objectives as \vell as 
scores for unanticipated dimensions that cut across the existing 
objectives. In any events the analysis will illuminate the struc- 
ture of performance in the domain, which should have impdrtant 
implications both for instruction and future measurement. ^ 

Group differences in structure. The issue of fairness in mea- 
surement impels us to inquire whether performance dimensions 
have the same jieaning and are measuredo^ith the same precis 
sion in different population groups. This issue of population gen- 
eralizability will be addressed for separate dimensions by irt 
scaling in the next section. Here, we propose the application of 
confirmatory factor analysis of covariance structures in different 
groups of the same age to see if the same number of dimensions 
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emerge in each group and if tL ^^ * /ul. . '•latea j n the same way 
(Joreskog & SQrbom/1979). 7;iiis method w.'^ be . ipplied to th.e 
comparison of male and female groups as well ao to black ^and 
white groups at each age level and/ ultimately/ to other grbup5 of 
special interest. : /'\r^^\ , 

: However ; when the dat a ar^ broken down by sex or by race , for ; 
example, the covariances obtained for minority g^^ 
spiralling may biB based on small samples, although pversairi 
may obviate this difficulty in some instances. Hence, in spn^ 
.cumstances it may first prove necessary to use Bayesian mi^^ 
data techniques to adjust^the sparse data by capitaU^ 

\ knowledge oif the t6tal coymance matrix arid the c 
group covariance matrix based on sizable samples (Dempster, 

; LMrd/&'Rubiri;1977): ■ '^)-::-;^^^^^^^^^^^ 

It is important to ascertain whether the covariarice or fa^^ 
structure in different groups is the same br,not'because the inter- 
preitation of group differences in niean level of performance 
depends upon it. Indeed, multivariate statistical tests on means 
assume an invariant covariance structure. Once similarity of the 
underlying fabtor pattern in different groups is established/ how- 
ever, the interpretation of mean differerices becorlQes lieg^^ 
in the sense that \here-is suppdrting empirical evidence that they 
reflect discrepancies along the same dimensions. If orily some o^ 
the factors are invariant acroi5s g;roups while others appear to be 
group specific, then comparisons of ^oup means on the invariant 
factors would be reasonable, d factor structure 

might be less benign, however, in th^ on ths interpret^^ 

tion of mean differences.%the same factor structxires are fouh^ 
hold in the different groiiJj)S / the equality of measurement preci- 
sion across groups inayalsb be tested by this confirmatory f^^^ 
model (Joreskog & SorbomV 1979; Rock, Worts, fii Grandy , ; 198l).> 

Age differences in stmcturer Similarity or difference^ 
ance structiires iri differetit yge groups may also be ^alyzed in / 
the sanie manner by this confirm^atory factor model. Of particular^ 
concern in age-group comparison is the possibility of develop- 
mental trends hot only in mean level of performa^^ 
degree of differentiation and integration of the skill dimensions at 
different age leyels.'There are! numerous theories oif humiMi devel-j 
: opment supported by considerable enipirical evidence that an iii- 
dividual's (Cognitive skills' and achievements become mpre differ^^ 
entiated over time (e.g., Ka^an & Kogan, 1970; Guilford, 1967). • 
This would in turn be reflected in differences in the factor inter- 



correlations among these dimensions at different age levels. Us- 
ing confinnatory factor analysis, .we can address this possibility/ 
in each subject area by testing for differences in the factor vari- 
ances and intercovariances across the three age groups of. 9-; 13-; 
and 17-year olds. , ' ^ - ^ 

Since we are also concerned about whethler age-related dif fer-- 
ences in factor differentiation occur in the isame way for all sex ■ 
and race groups, similar age-group compiarisons w^ 
ducted; if the resulting sample sizes permit; separately for male 
arid female and for black and white groups . AgaiU; any obtained 
age-group differences in the number and nature of underlying fac- 
tors will have critical implications for the interpretation.of mean 
differences between the age groups ; because that would imply 
that the same dimensions are not being measured or are not being 
*'*ineasured in the same way at different ages. - 

Scaling by Item Response THeory ' 

Itemlresponse theory (irt) defines the probability of answering an 
exercjise correctly as a mathematical function of ability lev^l or 
skil^J^he particular mathematical function niost widely used; 
thJ^^istic function; has one p^ameter for each individual— 
namfely;. ability level— and from one to three parameters chafac- 
terizW each exercise (Lord; 1980a; Lord & Novick; 1968). The ' 
item parameters reflect difficulty level; discriminating power; 
an^ likelihood of guessing. The three-parameter iriodel will be 
em^^hasized here because the oner and two-pa¥ameter versions do 
noi adequately cope with the realities of exercise variation. 

IRT methods are appropriate" for unidiiiiensional areas or. sub- 
areias in which the exercises are scored right; wrong; or no 
'rdsponse. In the 1983-84naep assessmerit; reading is the only area 
f<ir which irt methods will be fully used; although subareas of . 
qitizenship/social sttidies and possibly multiple-choice writing 
items w'ill also be analyzed. In subsequent years, irt scaling will 
e used for mathematics, science; and other' appropriate areas, 
e. pdssibility of using irt models for exercises haying other 
scoring formatS; such as those scored on a scale from 0 to 10; will 
also be investigated (e.g.; Samejima, 1972, 1973, 1974). The fol- 
lowing description of the rationale and prqpedures for data col- 
lection and analysis will typify irt methods to be used in areas 
having dichotomously-scored exercises, such as reading and 
mathematics.;) . . ■ 



Individual- versus group-based IRT scaling. In the proposed naep 
data analyses, the irt model to be employed will fit the responses 
of individuals, not some group mean of individuals. Although irt 
models defined at the level of groups, such as schools or demo- 
graphic subgroups, have been proposed (Bock, Mislevy, & Wood- 
son, 1982), it seems hardly plausible to assume that subgroup 
mean performance has a true functional relationship to mean 
level of skill in the subgroup. . ^ 

From this standpoint, such, group irt models seem fundamen- 
tally flawed at a theoretical level, as may be seen from the follow- 
ing example. Figure 1 shows a typical item response function 
representing the performance of individual respondents on a 
given exercise. The four /crosses mark the mean performance 
levels on this exercise of four different hypothetical scl^ools or 
subgroups. The students in the first subgroup or school (lowest 
cross), all have skill levels that are tightly distributed about -1, 
and thus about 20 percent of these students will answer the exer- 
cise correctly. In the second group, the range of skill is from -2 to 
0, and some 35 percent of the students answer correctly. In the 
third group, the range of skill extends from -3 to 1, and about 50 
percent answer correctly. In the fourth group, the range is from -1 
to 0, and about 35 percent answer correctly. Although the exam- 
ple is an extreme one, it clearly demonstrates that mean subgroup 
performance, whether at the level of schools or -of demographic 
categories such as those in the sampling design, cannot be ex- 
pected to have a true functional relationship to mean level of 
skill. Thus, such group-based models do not fulfill thelundamen- 
tal requirement of irt methodology, wjiich is that the probability . 
of answering correctly be a mathematical function of ability level 
or skill. 

Dimensionality.. Since irt models, whether individual- or 
group-based, are applicable only to unidimensional sets of exer- 
cises, the availability of covariances will be "capitalized on to 
meet this requirement. Factor analyses will be carried out to 
determine how the exercises in a skill area can be subdivided into 
subareas that are roughly unidimensional. In the mathematics 
area, for example, exercises may be classified into the following 
categories: calculation, story problems,' geometry, definitions, 
measurement. In one approach, a group factor will be extractied 
for each of these subareas and the residuals examined to see if 
there are other, significant group factors needed to account for the 
item.intercorrelations. The correlation of each group factof with 
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Figure 1 . ^ 

School Means Plotted in Relation to Exercise Response Function 




LevelV Skill. 



the general factor for all the exercises will also be computed. We 
will then decide whether a general factor may be substituted for 
all or some of the group factors without serious loss. 

In this way we will be able to appraise whether all or nearly all 
reading exercises (or mathematics or science exercises) can be 
analyzed together in irt work. .If a few exercises do not fit this 
procedure, they will be removed from the irt anialysis and analy- 
zed by conventional methods^such as proportion-correct. If the 
exercises fall into two or more subareas that cannot be merged, 
each such subarea will l)e treated separately for irt analysis, pro- 
vided it contains enough items for this purpose. 

Assessment. With bib spiralling of exercises, irt methods may 
be applied to exercises appearing in different packages— indeed, if 
unidimensidnality is satisfied, to all of the exercises in a subjeqt 
area. For example. Table 3 shows a balanced lattice design allo- 
cating 25 blocks of different exercises among 30 s.ubgroups of stu- 
dents within a given age group. If there are 12,000 students alto- 
gether, then each exercise is taken. by 2,400 people. If there are 
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250 exercis.es altogether, each block contains ten exerci^s. Each 
student answers five blocks or 50 exercfses. The. existing com- 
puter program logist.(M.S. Wingersky, 1982; M.S. Wingersky, 
Barton, & Lord, 1982), which we propl)se to use to estimate each 
individual's level of skill or proficiency,^ is designed to handle 
sparse data matrices such as Table 3. 

There is, of course, no intention ofreporting skill levels for iii- 
"dividuals. Rather, the assessment of groups, which is the ulti- 
mate purpose of NAEP,.wiir be accomplished by the pooling of in- 
dividual assessments. This assessment of the individual is given 
by the maximum Ukelihood estirnate of his or her level of skill 
under irt assumptions (Lord, 1980a). In naep applications, each ^ 
•^individual term in the maximum' likelihood equations can be 
weighted by the sampling weight assigned to the individual in the 
sampling frame. One efficiency of logist is exemplified by noting 
that the computer time used is proportional to the amount of data 
(to 2400 X 250 12,000 x 50 = 600,000 responses in the illus- 
trative example), not proportional to both the number of exer- 
cises and the number of people simultaneously (not to ^250 x 
12,000 - 3,000,000).' ' 

LOGIST uses a three-parameter logistic moidel for the data. t^S^^y/* 
final output consists of one number for each individual that 
assesses skill level and three numbers that describe each exercise: 
one for the difficulty .of the exercise, another for the extent to 
whichj^success on the exercise is related to the overall assessment 
in the area scaled, and a third number rej^xesenting the proportion 
'of successes on the exercise among very unskilled individuals. 
This last number, which is often ignored or misused, should not 
be neglected during naep assessment. 

The success level for unskilled individuals, denoted for exer- 
*cise i by ci is necessarily nonzero for multiple-choice items, 
which can be answered correctly by-guessing. The usual oversim- 
plifications assume that all ci = 0 (one-parameter or Rasch 
models and two-parameter models) or that all ci. are equal across 
exercises. It is also commonly but mistakenly asserted that ci 
cannot be accurately estimated. Figure 2 is presented to contra- 
dict all these views. It shows ci estimated by logist from two dif- 
ferent data sets for the same exercises. The exercises plotted are . 
all those for which bi - 2/ai > - 2, vhere bi is the irt difficulty 
parameter and ai is the discriminating power. It is clear from Fig- 
ure 2 that the ci can 'be reliably estimated for exercises that are 
discriminating and not too easy. , 
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Table 3 



Balanced Lattice Design Allocating Exercises to People 
Blocks OF Exercises 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 



1 • 

2 • 

3 • 

4 • 

5 • 

6 • 
7 

8 

9 
10 
11 
3 12 
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S 13 
^ 14 
S 15 
0 16 
^ 17 

18 

19 

20' 

21 - 

22 

23 

24 

25 

26 

27 

28 

29 

30 
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Suppose an educational statistician assumes that allxi = .2 for 
a large set of naep exercises. The data will very likely contradict 
this assumption,. For example, the statistician will later find that 
one exercise was answered correctly by only 1 1 percent of all indi- 
viduals in a certain large socioeconomic subgroup. 
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Figure 2 

Comparison of Estimated C Values 
" SAT ESA04 March '82 
Solo vs Concurrent 



C - Concurrent 




LOGiST also affords a solution to a problem in the current mode 
of NAEP reporting. When naep reports that 30 percent of individu- 
als in a certain subgroup answered a particular four-choice exer- 
cise correctly, it is difficult to interpret this number. If individu- 
als who had no idea of the correct answer guessed at random on 
the exercise, the 30 percent has a different meaning than if all 
such individuals either omitted the exercise or indicated they did 
not know the answer. The recent naep practice of reporting aver,- 
age percent correct across exercises judged to represent a par- 
ticular objective or achievement area simply exacerbates the 
piroblem. 



Figures 

New Jersey Basic Skills Results for 
Six Reading Comprehension Exercises 
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In IRT work, it is seriously incorrect to treat omitted or "do not 
• know'' responses the same as wrong responses. It is also incorrect 
to trest omitted dr "do not know" responses as if the correspon- 
ding exercises had not been administered. Cunently, logist is the 
oniy^iRT program to our knowledge that treats such data in a rea- 
sonably appropriate manner (Lord, 1974). , 

■ ■ ■ V . - ■ ^ '. . • • 49 



■ I- 



Figure 4 

/ New Jersey Basic Skills Results for 
Six Mathematics Computation Exercises 




Checking IRT model fit. Figures 3 and 4 show the type of plot, 
that has been used in all irt operational work at Educational Test- 
ing Service for the past three years in order to provide a visual 
check on how well the irt model is able to fit the data. The 
smooth curves in each figure are estimated response functions for 
six consecutive four-choice exercises in the New Jersey College 
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Basic Skills Placement Test, the horizontal axis in each plot 
shows the skill of the respondent; the vertical axis shows the 
probability of a correct answer. 

Respondents are divided into 15 class intervals according to 
their estimated level of skill. The area of each plotted rectangle or 
square is proportional to the. aumber of examinees in the corre- 
sponding class interval. The center of the rectangle indicates the 
observed proportiori of respondents in the interval who actually 
answered the exercise correctly. The vertical line in each interval 
extends, two binomial standard errors above and below the 
theoretical curve. 

Figure 3 presents results for six reading comprehension exer- 
cises. The data came from a sparse matrix such as that in Table 3. 
The number of examinees for these items ranged from 2,400 to 
' 9^600. Figure 4 shows results for six mathematics computation 
exercises. Each plot represents the results for 21,000 to 24,000 ex- 
aminees. 

.Figure 5 

Distributions of Skill in Three Subgroups Together with Expected 
Performance Levels on Various Benchmark Exercises ^ 

Group B Group C 
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Examination of these plots convinces lis that (1) unskilled ex- 
aminees have better than zero chance of success^ (2) their chance 
of success varies sharply from exercise to exercise, (3) their 

" chance of success on the more difficult exercises can be accurate- 
ly estimated; arid (4) the slopes of the curves (the discriminating 
power of the exercise) vary sharply from exercise to exercise. 

A chi square comparing theoretical and observed frequencies is 
also computed for each plot. It is helpful to list the contribution 
of each class interval to this chi square. Although this procedure, . 
like other available procedures (Hanibleton; 1982); does not per- 

. mit an exact test of statistical significance; it has nevertheless 
proved helpful in locating ambiguous or other anomalous exer- 
cises that clearly do not fit the irt model. Sujch exercises can be 
studied by conventional methods based on proportions of correct 
answers. 

Estimating group performance on a common scale. The main 
purpose of the irt analyses is to provide a common scale on which 
performance can be compared across groups and subgroups, 
" whtith'erTest^^ same time or several years apart, irt allo^ys 

us to estimate group performance for any group or subgrouP; even 
though all respondents did not take all the exercises in the naep 
pool. 

A technical report of results will contain many figures such as 
Figure 5; showing the distribution of skill in various subgroups 
together with iexpected performance levels on various benchmark 
exercises. The vertical arrows mark the median and the first and 
third quartiles in the distribution of skill for each specified group. 
The figure can be read to give the proportion of correct answers on 
each exercise expected for individuals at each quartile (or at any 
other point) in each group. It can also be read to give the propor- 
tion of individuals in any group who have less than some speci- 
fied probability of success on any given exercise. More accurate' 
inforination will also be given in numerical tables. 

The actual text of the benclimark exercises will accompany 
such figures and tables. Note that this provides a critenon'Tefer- 
enced interpietation of the meaning of each numerical level of 
skill: the skill score is interpreted in terms of expected perfor- 
mance ^pn- typical; benchmark exercises. Noxm'Tefexenced inter- 
pretations are^l^o provided by such figures and tables by refer- 
ence ro the groiip distributions. 
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. Appraising item bias. If an exercise has exactly the same item 
response function in every gfbup assessed, then individuals at any 
given skill level will have exactly the same probability of getting 
the exercise correct, regardless of their group membership. This 
is true even though some groups may have a lower average skill 
level than other groups. However, if an exercise has a different 
item response function for one group thanjor another^ then the 
item is biased in some way. ' ' 

If the item response function for one group is higher than that 
for another at all leyels of skill, then individuals in the first group 
have a better chance of getting the exercise correct than individu- 
als pf equal ability in the second group. A more complicated form 
of bias occurs if the item response functions for two groups cross, 
as is often found in practice,'because then the exercise is biased in 
favor of some members of each group but against other members. 
If item bias is substantial, the exercise should be omitted from 
the LOGIST run and studied by conventional methods, if at alL 
These types of bias can be evaluated using irt methods by esti- 
mating item parameters separately for each group and comparing 
the item response functions across groups (Lord, 1976, 1980a). 

^Peyelppnicnt of a common scale across age levels. Table 3 illus- 
trates the assignment of exercises to individuals in the same age 
group. Many exercises are given both to 9-year olds and to 13-year 
olds; many others are' given both to 13-year olds and to 17-year 
olds. This design is indicated in Figure 6, Each row of Figure 6 has 
a fine structure like that in Table 3. 

The exercises in the top and bottom rows of Figure 6 are divided 
into three categories: (1) those exercises that are common to two 
age groups, (2) those that are similar in topic and in difficulty to 
these common exercises, and (3) other exercises. The main logist 
run discussQd above will not be limited to any single age group. 
Rather, it will include all data in Figure. 6 except for the exercises 
marked "other" for 9- and 17-year olds; all exercises for 13-year 
olds will be analyzed. This will place all age groups on the same 
skill scale. After this has been done, each individual's estimated 
skill level will be held fixed while the parameters describing the 
"other" exercises are found by a further logist run. 

Measuring change across time. Exercises in naep reading ad- 
ministrations prior to 1983-84 were administered in printed form, 
with taped pacing but without the use of taped aural presenta- 
tion. If the effect of pacing proves minimal, the 1983-84 reading 
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Figure 6 

Assignment of Exercises WUhin and Across Age Groups 



9-YEAR OLOS 

13-YEAR OLDS 
17-YEAR OLOS 



OTHER SIMILAR 

COMMON 



COMMON 

I SIMIUR OTHER 



scale can be and will be extended to all exercises administered in 
past years. This could be done by the method described in the pre- 
vious section. 

A preferable procedure will be to make separate logist runs.on 
data from earlier naep administrations. Exercises common to an 
earlier administration and a later administration will be used to 
place all earlier results on the saine scale as the 1983-84 results. 
'This will be accomplished by the computqr program tblt in cur- 
rent use at ETS (Stocking & Lord, in press). This program finds the ' 
linear scale transformation that places two sets of irt parameters 
on the same scale in such a way as to minimize, a certain sum of 
squared errors. The quantity minimized is the mean squared dif- 
ference between number right scores on the common items pre- 
dicted from the two sets of irt parameters that are to be placed on 
the same scale'. 

By the same method, future groups assessed in reading without 
taped pacing/can be compared on a common reading proficiency 
scale with groups assessed in 1983-84. Furthermore, if the effect 
of pacing proves to be minor, these future groups can also be com- 
pared oh a/common reading scale with groups assessed in previ- 
ous NAEP administrations. Similar comparisons can be made for 
mathema'tics and other areas, except that the use of aural tape 
presentation before 1983 may impair attempted common-scale 
comparisons extending backwards in time before 1983. 

The/power of iRT scaling. Among the considerable benefits of 
IRT scaling for naep is the availability, for strictly analytical pur- 
poses, of weighted composite scores for each individual on unidi- 
mensional aspects of the subject area in which he or she was 
assessed. This means that performance dimensions in each sub- 
j6c( area may be correlated both with each other and with back- 
ground, attitudinal, and program vvariables tied to these same stu- 
dents. Furthermore, a variety of subgroups may be defined in 
therms pf these variables— such as bilingual versus monolingual; 
large school versus small school. Title I participation versus none, 
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or science interest versus arts interest. The educational perfor- 
mance of these constructed groups may then be compared, if the 
resulting subgroup sizes are adequate, either simply or with co- 
variance controls for other variables. Moreover, when spiralling 
occurs across subject areas, the correlational structure of perfor- 
mance scales and their correlates may be addressed both within 
and across subject fields. 

Another benefit of irt scaling is invariance both of item 
parameters across respondent groups and of respondents' skill 
levels across subsets of exercises. This means that each individu- 
al's skill level may be estimated from any subset of exercises and 
that exercises may be added or retired from the assessment at any 
point without affecting comparability of results. Furthermore, 
since the skill scales are unbounded, they are not warped by floor 
and ceiling. effects in the way percentages and total scores are, so 
they tend to be more linearly related to other quantitative vari- 
ables. These advantages combined with those previously dis- 
cussed—especially the capacity tot both criterion-referenced and 
norm-referenced interpretations and for linking overlapping sets 
of exercises to^^^ scales spanning subjact area, popu- 

lation subgroups, age levels, and. time periods— make irt scaling 
not only ideal for istaep purposes, but essential. 

Analysis of Time Trends 

There are a variety of opportunities for studying time trends in 
the data gathered in the initial wave of the naep redesign in com- 
bination with data from previous waves of naep. The availability 
of trend data for the subject areas covered is summarized as 
follows: 

Rea'ding: 70-71, 74-75, 79-80, 83-84. • 

Writing: - 69-70, 73-74, 78-79, 83-84, 
Citizenship: 69-70, 75-76, 81-82, 83-84. 
Social Studies: 71-72, 75-76, 81-82, 83-84. 

Thus there are four waves of data for each of the four subject areas 
to be assessed in l983-g4. The methods of trend analysis dis- 
cussed below are applicable 'to time-structured data of this type 
. and hence may be employed with past waves of data in other sub- 
ject areas or with future waves of data. 
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Two different levels c/r types of trend analyses are proposetT 
The first is at the level bf individual exercises, and the second is 
at .the level of scales or/composites derived from the responses to 
all of the exercises iiyja subject area or subarea. Both types of 
trends will be analy/ed. Exercises that are repeated in several 
waves of data collec/ion give us the opporturiity to see how the 
distribution of very /pecific knowledge or skill has changed over 
the years encompas/ed by the data. Scaled or composite scores de- 
rived from sets of /xercises, which may or may not be'repeated 
entirely in several/ years, will allow a more aggregate picture of 
the changes'in ihJ distribution of knowledge for each subject area 
across the relevant /time periods. Trend analysis at the exercise 
level differs froiTl thht at the scale level in terms of both the appro- 
priate question* to ask and the corresponding methods to apply. 

Analysis at/the exercise level. The question most appropriate 
for this level/of ari'alysis is: "How does the proportion of students 
who get the' particular exercise correct vary over the years stud- 
ied?" The pain Concern is to identify the overall trend across all 
student^of a gi/ren age and also to identify significant student 
subpopiUations /exhibiting trends that, differ from the overall pic- 
ture. The overall trend is expressed by the "item x year" interac- 
tion while Inai'or subpopulations in which the trends differ will- 
create a "subpopulation x item x year" interaction. These inter- 
actions mayib'e analyzed most powerfully using the modern sta- 
tistical theof^ of multi-way contingency tables (Bishop, Feinberg, 
& Holland, il/975). . ' 

By these procedures, one first forms a multi-way contmgency 
table having at least these three dimensions: performance on the 
exercise (2 levels-right, wrong); year of data collection (4 levels); 
and subpopulation membership (n levels). Examples of subpopu- 
lations ar(^'sex, ethnicity, region of the country, urban-rural, and 

so forth. / , , . u 

Strictly' speaking, the dimensions ought to include those that 
describe ihe sampling frames for each year. This permits one to 
use the unweighted data and simplifies the sampling properties of 
the relcArit test statistics. In this framework, the overall trend in 
the projiortion correct is associated with the item x year interac- 
tion as expressed in an appropriate log-linear model for the multi- 
way contingency table. Equivalently, logistic regression methods 
can be/used to obtain the parameter estimates.. 

Serious deviations from the overall trend for the given exercise 
may l3e determined by testing for subpopulation x item x yearin- 
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tcractions usinj; log-lincnr models for the multi-way tablu. This 
will result in identifying two classes of exercises that are repeated 
over time. The first type will be those exercises for which the 
time trend is fairly consistent across all major subpopulatiofiij, 
The second type will be those exercises for which there arc signi- 
ficant differences in the time trends across subpopulations. The 
use of modern contingency table methods allows these two, types 
of trends to be rigorously identified and distinguished from one 
another. 

In addition to the time trends just described, further analytical 
power is afforded when the number of years intervening between 
assessment waves matches, the age difference of the samples 
assessed. F^)r example; the dohort of students assessed in 1979-80 
at age 9 will be 13 years old in 1983-84, Similarly, 13-year olds in 
1979-80 will be 17 years old in 1983-8.4^ Exercises that arc 
repeated in these two wavcis of data collection and which ate ad- 
ministered to both '9- and p-year olds or to 13- and 17-year olds 
give us a double-barreled look at time trends. We can investigate 
how a cohort, say 9-yeiyr olds in 1979-80, changed in their 
responses to a repeated exercise when the cohort became 13-years 
old in 1983-84, and we aan compare these changes to that for 
_o t h e r c (^^^ 

arc similar to those described earlier. 

Although such a linking of assessment intervals ' igc differ-' 
enccs in the sample occurs only sporadically ih pi;.: ^scssment 
waves, appropriate matfehes do occur for writing il*;69-70 and 
1973-74), reading (197(y-71 and 1974-75), social studies (1971-72 
and 1975-76), science /(1972-73 and 1976-77), art (1974-75 and 
1978-79), and mathenyatics (1977-78 and 1981-82). If the sched- 
ule outlined in Table / is adhered to, appropriate cohort matches 
would be routine in tjhc redesigned naep. In addition, using this 
proposed schedule, a cohort match occurs immediately in reading 
(1979-80 and 1983-84) and a full cohort cycle is achieved in 
mathematics (1977-78, 1981-82, and 1985--86). 

Analysis at the scale level. Trend analysis at the scale level is 
concerned primarily with how the distribution of scale scores for 
a given subject area changes over time. The issue of trends that 
are the same across all subpopulations versus those that differ in 
different subpopulations also arises as it did for individual exer- 
cises. An analytic tool that is appropriate for this type of analysis 
is the use of linear models to investigate the main effects of year 
and the year x/ subpopulation interactions. The use of linear 
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modcb is scared fur studies of changes in the means of the dls- 
trUnulons. Studies of changes Ui other features of the dlstrlbu- 
tlons of scores are more appropriately done using plots of the data 
once the effects on the means have been Isolated. 

Reporting results. Once the significant results of the trend anal- 
yses are discovered, they will be summarized In simpler tables In 
which the data are weigh ted appropriately to give popMlation esti- 
mates for the level of each variable (percentage ^or scale sco^c) 
a^iross years and, if necessary, across the relevant ^ubpopulatlons. 



'^Caumir' or Path AnalyHl» 

If NAur is conceived mainly as a data collection function with a 
mission to dev.elop and report population estimates of educa- 
tional attainment for various groups over time and to codify the 
data on public use tapes for others to analyze, the enterprise is 
doomed to limited and sporadic impact. What is needed is a sus- 
tained program of analyses that seek reasons for the various levels 
of educational attainment and attempt to delineate their implica- 
tions for policy alternatives. The availability of public use tapes 
will stimulate spme of this activity by invcstigatorsJhroughout 
the country, but availability of data tapes alone will not sustain 
it. Every effort should be made to buttress widespread use of the 
data tapes because the ideological nature of education demands a 
multiperspective examination. One way to accompUsh th|s is to 
maintain a continuing naep program of educational and policy 
analysis that would provide timely perspectives on emergent and 
recurrent issues and at the same time stimulate and facilitate 
other investigators to elaborate, modify, and challenge naep find- 
ings and interpretations. 

This approach stresses analyses which focus on possible ex- 
planations of successful and unsuccessful performance. For ex- 
ample, that males outperform females in mathematics at a par- 
ticular age may be a fact, but its policy and action implications 
would differ depending on whether there are also large sex differ- 
ences in attitudes toward mathematics and in the number of 
mathematics courses taken. We do not contend that analyses of 
correlations based on nonexperimental survey data can answer 
* questions of cause and effect, but such analyses can lead to rejec- 
tion of some proposed explanations as inconsistent with the 
existing data and may suggest hypotheses for future survey mea- 
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Kurc'H aiul ior ((»rnwil cxpcriniuntiUicMi hy cithcfH In ilKfcrciu net- 
tinKH. Hy.ihls means, nai'.p wuuhl mi only report facts hut relate 
them to context and to trolley alternatlvcHt 

DiickKround and proxrani varfablcfi. To relate nafp aehii'vement 
data to iHSiicH of educational practice and policy requires addi- 
tional information about the backgrounds of the students as- 
sessed and about their experiences in schools and pro>;rams. Some 
Information of this type is already beinK collected by naiip, but 
the covcrnKc of the student and school questionnaires needs to be 
extended to allow us to address more fully the kinds of national 
concerns, human rrsouree needs, and program effectiveness 
issues raised in Chapter I. Granted that questions to students and 
principals eannot be expanded indefinitely, but they can be ex- 
panded considerably beyond their current limits, Furthermore, 
much school and community information can be assembled by 
NAiir field personnel. 

The variables to be tapped should be carefully chosen from a 
structurcil array of alternatives so that priority jud);ments arc re- 
quired and systematically justified JMcssick Harrows, 1972). 
These variables may differ from subject area to subject area, from 
a^c level to a>;e level, and from assessment wave to assessment 
wave, but a core set of key common variables should lie retained* 

The kinds of student and background variables to be considered 
include demoKrapbic descriptors; nonNAEi' measures of academic 
achievement; participation in special programs; measures of atti- 
tudes, interests, aspirations, and plans; of time spent studying, 
reading, viewing TV, in athletics and other activities, and (for 
older students) in employment; and, of a variety of family status 
and process characteristics. The kinds of school and program vari- 
ables to be considered include school descriptors for racial, ethnic, 
and sns composition as well as desegregation history; size and 
type of school and community; availability of special programs; 
types of curriculi), tracking arrangements, and extra-curricular 
activities; resource utilization; and, indicators of school climate 
and image. 

In selecting specific variables, guidance would be sought from 
the educational literature but will be evaluated with great care. 
For example, measures of school facilities and curricula were 
only wehkly related to verbal achievement in the Coleman (1966) 
equal educational opportunity survey. But the measures reflected 
neither the quality nor the utilization of the facilities and curri- 
cula, yet they still appeared to have more impact for some types 
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of students than for others in more refined analyses (M. S. Smith, 
1972). Furthermore; the educational achievement criteria^ in the 
Coleman s^dy were distinctly different from the subject-area ex- 
ercises of i/aep. - . 

Structural^odels and path analysis. Given that some amount 
of ^nformationwill be .available about student background, home 
and school- environment, and program participation, structural 
equation or path models of educational attainment can be formu- 
lated ai;d tested. Path analysis is a technique used to a^ssess the 
direct or so-called "causal" contribution of one variable to an- 
other in noneXperimental data. The word "causal'Ms not meant 
to imply^any deep philosophical connotation beyond a shorthai;d 
.designation for an unobserved hypothesized process. The general, 
problem i^ that of estimating the parameters of a set of linear 
structural equations representing the cause and effect relation- 
ships hypothesized in a particulaftheoretical conception. 

Several recent path models incorporate unobserved latent con- 
structs or factors which, while not directly measured, have opera- 
tional implications for relationships among observed variables. In 
some models the observed variables are viewed as effects of the 
hypothesized constructs while in others they serv e as causes, or 
as both causes arid effects, of the latent constructs (joreskog 

. - Figure 7 
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Sorbom, 1979) Bentler, 1980). La effect, this approach combines 
path analysis'wjith factor analysis. 

As an example of structural modeling, a hypothesized explana- 
tory model for student achievement is given in Figure 7. Individ- 
ual student performances in reading, writing, and citizenship/ 
social studies are hypothesized to be functions of a number of., 
other* variables including demographic, attitudinal, expec-ta- 
tional, aspirational,' and peer dimensions as well as characteris- 
tics of. home and.;5chool environments and of school processes 
and programs. All of, these components combine to form a net- 
work of. specified interactions that affect educational perfor- 
mances. Indeed, educational performance in turn may affect 

. some of its .componenrs such as aspirations and attitudes toward 
oneself. Simply to report differences in performance for different 
groups, while ignoring the available data for exploring this net- 
work,^leads. thc'recipient of the results to engage in uninformed 
speculations about their meaning and possible causes. 

It is anticipate^! that explanatory rnodels similar to Figure 7 
will- be formulated and tested both within and across population 
groups. Fpr example, it is possible that the size of the relative ef- 
fects und the processes fhrnngh which r hey a ct—that is^ indirert 
effects— may be different for different sex, race, ethnic, and age 
groups. Comparisons between models for 9-year olds and 17-year, 
olds, , as an*instance, may suggest that school-related program 
variables have a steadily increasing irnpact while parental vari- 
ables decrease in influence during this transition. Cross-ethnic 
and cross-sex group comparisons^of similar models may be partic- 
ularly informative with respect to how different programs and 
objectives affect such subgroups. 

Past analyses of ^educational performance have faltered on a 
.yariej:y of technical problems. The traditional approach, as exem- 
plified by the Coleman equal educational opportunity survey, 
used a single equation model of educational attainment and 
employed regression analysis to estimate the degree to which dif- 

. ferent components affected achievement. Such an approach has 
no. way to disentangle the correlations among predictor variables. 
Since the order in which variables are entered into a regression 
equation markedly affects the estimate of- their importance and 
since Coleman entered school variable's last (thereby minitriiiiing 
the estimate of their effects), it is little wonder that he concluded 
that schooling had little impact. Later investigators of the same 
data using different analytic methods showed a substantial effect 



of school variables (Mayeske/ Wisler, Beaton; Weinfeld, Cgheix^ ; 

Okada;Proshek/&T^bler; 1972)7 ' -'--rC-'.- 
KtcQm developments in path, analysis and in the analysis of ■ : 

structural equations (Joreskog & Sorbom, 1979; Bexitler; 1980) ^ 
make it possible to specify much more. realistic eiplanajtqry ; 
models of educational performance and to avoid some of the techr^z: 
nical problems of regression analysis. Basically, the net\vork^^ 
rdlationsHips in the explanatory model is represented by a set of 
equations, and the ddta are used to estimate the unknown coeffi- 
cients of the equations and the degree of confidence that can be 
placed in the estimates. A very flexible coniputer program, lisrel v 
(Joreskpg & Sorbom, 1981); is available for the computations. . 

Several advantages of using structural equations should be 
noted. Parameters for the entire model are estimated simultane-^ 
ously, thus avoiding the bias involved in estimating the equa- 
tions separately by regression analysis, Reciprocal relationships 
may be introduced, such as the effect of performance on attitudes 
as well as the effect of attitudes on performance. TThe explanatory 
variables in the model need not be considered to be measured 
without error, as in regression analysis. Furthermore, the errors 
in the variables may be assumed to be correlated. When two or 
more variables are cbmbinM a composite,, a reliability is 
^ computed, reported, and used in the estimation proced ure. , 



Special Studies 

Inevitably, a number of special concerns arise over the years that 
r>jAEP cannot readily address within its regular finaiicial resources 
but that would be beneficially addressed within the NAEP environ- 
ment. This is because the special sOidies, if done in the naep con- 
text, might be tailored to tjenefit naep functions or broaden its 
purposes, while . at the same time the study in question capi- 
talizes on existing facilities or ongoing activities. For these rea- 
sons, NAEP should be committed to a continuing effort to develop 
funding for such additional studies from private foundations or 
appropriate government agencies. The following kinds of studies 
should be high on the agenda. 
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Assessment of- 

Functionally-Handicapped Students 

' ■ ^ ' ^> . .' ' • . • . ■ ' ■ ' '■■ ' 

In a recent report of the National Academy of Sciences (Heller/ 
Holtzman/ & Messick, 1982), the educational progress of edu- 
cable mentally retarded and other func.tionaliy^handicapped stu- 
dents was singled out as the touchstone for equity- in special edu- 
cation. Since such students are currently excluded from naep, it 
seems fitting that naep attempt to mount a special assessment of 
their educational competencies and ultimately of their educa- 
tional progress. Indeed, the effort would be facilitated if such 
students were not only identified for exclusion in the naep sampl- 
ing process, but were described in-more detail in regard to their 
background and program experiences, as proposed in this naep re- 
design: . - 

Assessment of the competencies of handicapped students faces 
a number of major roadblocks because of fundamental problems 

• in exercise development, administration, and interpretation that 
are encountered (Bennett, in press). The Education for All Handi- 
capped Children Act (PL 94-142) requires that educational goals 
for handicapped students be individually prescribed. From the 
standpoint of assessment, this requirement results in the creation 

:^of-an-unmiinageably4arge-array of goal5~from~which-common^ob^ 
jectives fof exercise development may not be. easily extracted 
(M.aher & Bennett, in press). The diverse needs of handicapped 
students also demand departures from traditional exercise for- 
mats; exercises ordinarily printed in standard form must typically 

..be created in braille, cassette,\and large-type versions. Adminis- 
tration is made difficult becaus^ many disabled students require 
untimed individual administrations. Special probes and monitor- 
ing may also be required to assureVhat the instructions are under- 
stood. - 
Such departures* from standardized conditions, as.well as the re- 
quired' variations in exercise format, in turn create dilemmas for 
data interpretation. Aggregation of data is at best problematic and 
at worst pointless unless individual assessments can be placed on 
a common scale— or unless some kind of defensible basis foi^.com- 
parability can be realized.; However, if comparable assessments 
can be achieved for students with the same type of handicap, then 
their educational progress could be monitored even 'though it 
would not be strictly comparable to the progress of other handi- 
capped or nonhandicapped groups. 
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These^difficulties were recited not to justi^^ 
important segment of the school population ftpm^^^ 
but to underscore the nature ; of the challenge to ineasuren^ 
specialists;.and to niake it clear'why this effort should be a'series^^^^^ v 
of special stiadies rather than an integral^part of naep.^o begin ; :? 
with, the problems of assessing the\rnentally and physically han- 
dicapped require ■ concentrated attention 

that should capitalize upon the NAE^fidd presence in schools but:^^^ 
should: not disrupt that -presence or: the regular naep activities, - v - 
Ultimately/ if these assessment problems can be solyeid/ the edu- ■ : 
cational progress of functionally-handicapped students might ; , 
become an integral part of the national assessment; 

Assessment of ■ ' 

Limited-English Speaking Students ^ 

Since the other major group of students excluded from NAEP— the 
nonEnglish proficient— typically come from ethnic minority 
groups, their continued exclusion may;, sericiusly bias interpreta- 
tions of the educational progress of those ethnic groups. Further r^^^^^^^ 
more, as with special education for the handicapped, the touch- ^ 
stone for equity i n bilin gu al educat i on is the'educational p ro gress 
of the students. For these reasons, naep should mount a' special 
study attacking the measurement and logistical problems in 
assessing non-English proficient groups. ; 

These problems are no less formidable than those of assessing 
the handicapped First, exercises must be developed in a number 
of different languages— although this might be addressed in 
waves of one language at a time, beginning with Spanish because / 
Vof the size of the Hispanic minority in the country. Aside from 
the substantial resources required to accomplish this, diffeferices^ 
among languages make it difficult to* develop; -noiv-English e^^ 
cises that are precisely comparable to English-language versions. 
Second, non-English proficient students often vary in their knowl- 
edge of the written form of their langiuage. Even though they may 
speak that language better than they speak English, they may not 
read that language well enough to be examined in it via printed 
exercises This underscores the point that one of the goals of 
assessing limited -English speakings students should* be assess- 
ment of their proficiency^n both English and their native lan- 
guage. Finally, inclusion of students from .backgrounds providing 
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Jit tie preparation for formal examinations necessitates using spe- ; 
cially-trained examiners to assist students in understanding the ; . 
requiremeiitsTof the examination situation. v V ■ V 
iAgain, this litany of troublesome problenis is not meant to 
justify continued exclusion of non-ErigUsh proficient students 
from NAEP, but to highlight the need for a special frontal attack on 
an important national issue in educational assessment. : ; J : 

Innovative Exercise Development 

Although attentioh'to innovative exercise development.should.be 
a routine part of naep's day-to-day activities, the focus in that 
context tends to be on the development of new ways— that are 
more valid or efficient or interestirig— to measure dimensions 
already being measured in old ways. In contrast, this proposed 
special study focusses as well on the development of new ways of. 
. using;6ld methods to assess new dimensions and, most impor- 
tantly!^ on new ways of assessing new dimensions that have previ- 
ously been difficult to capture. It is proposed as a special study 
because a critical mass of attention and effort is needed at the out- 
set, although the innovations develof)ed and the innovative mode 
. of development should ultimately be incorporated as, standard 
~7naep approaches. 7- ,■ . . ■ ■-.y ■ ^ ^ 

As an example of using old methods in new ways to measure, 
new Simenisions, consider the possibility of using integrated sets 
of multiple-choice items to assess complex problem-solving or 
decision-making processes in a subject area. Since each step in . 
complex problem solving entails a decision point or a set of deci- 
sion points, multiple-choice items could be constructed to assess 
the choices made— for example, the kinds of information sought, , 
the strategies utilized, the hypotheses generated, the analyses 
undertaken, the alternatives weighed, the solutions selected, and 
so forth, perhaps each with an associated item that requires selec- 
tion of the reason for each move. The multiple-choice formats 
would be broadly conceived to include matching and keylist pro- 
cedures, for example, as well as more standard versions. Such in- 
tegrateid sets of exercises could also be branched depending upon 
the choices made at each point, with or. without provision for 
recycling. . . 

As another instance, if multiple-choice exercises were con- 
structed SQ that selection of incorrect distractors were indicative 
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of common errors made during learning, then pat terns of distrac- 
tor choice might be: diagnostic of instructional problem areas. 
With such exercises; reports of average percent correct could be 
accompanied by summariesxf the types and frequencies of errors 
made, therieby enriching the utility of the results for instructional 
purposes at the classroom leyel. ; 

Both^f these exainples illustrate a mearis.of overcoming one of "; 
the major criticisms of niultiple-choice exercises— namely, their 
rigidity of application and orientation to outcoirie rather than pro- 
cess. At the same time the new uses retain the iriajor advantages 
of multiple-choice methods—namely, the economy, efficiency, : 
and ease of administration and scoring that historically have 
tipped the scale in favor of their use over other types of exercises. 

An example of new ways of assessing new dimensions that 
have been elusive in the past is the use of problem simulations, 
which might be presented by printed material or by film or video- 
tape techniques. Students might be asked to generate as many 
alternative hypotheses as they can for a given problem, for .exam- 
ple, or as many alternative reasons as they can for a given out- 
come. Such productive responses could then be judginentally 
scored for fluency, flexibility, and originality or other aspects of 
T^tvef g cnt t h i i> k-ing'(e.g., Frederiksen & Evans, 1974; Ward, 1982; 
Ward, Frederiksen, & Carlson;; 1980). The simulations could also 
be~cbnsffucted to assess sen^ifmty to problems or problem -find- 
ing skill?! in various subject areas. 

With videotape technology, :simulated interpersonal scenes 
could be presented and periodically interrupted with questions or 
tasks to assess sensitivity to interpersonal cues, appreciation or 
tolerance of individual .and group differences, and a variety of 
other social skills (e.g., Strieker, 1982).. In- additiori, videotape 
preseittation could facilitate assessrnent of understanding and 
appreciation of the performing arts. Finally, computer technology^ 
offers another powerful vehicle for innovative exercise develop- 
ment which will be briefly discussed below. 



Computer- Assisted Assessment 

Available computer technology can improve the efficiency of a 
number of i^aep activities almost immediately—such as the use of 
computer networks for remote conferencing, which would facili- 
tate conimittee work on such activities as objective setting and 



exercise review while reducing the number of facertcnface meet- 
ings required. Another instance is remote access to naep data 
bases for special analyses or inquiries by the various naep com- 
mittees or by NiE. If such capabilities have not yet been intro- 
duced, they should be explored'in the near future. However, 
direct contributions of computer technology to the main naep 
activity of assessment require special study. Such a special study 
or set of studies should not only address the feasibility and ap- 
propriate timing, of introducing cornputer-assisted assessment 
into NAEP, but should attempt to develop the technical means 
for optimizing computer use in exercise development and admin- 
istration. 

One set of issues iiivolves the use of the computer for exercise 
administration— such as to insure proper spiralling of exercises 
within and across subject areas during individual administrations 
or to obtain efficient assessments of individual skill levels via 
tailored'testing procedures (Lord, 1977, IS^SOb), or some com- 
bination of both. Another set of issues involves the development 
of measurement procedures and innovative exercises that capital- 
ize on the algorithmic and heurisitic capabilities of the computer 
to improvje the assessment of existing and new skill dimensions. 
For example, with computer administration, latency and speed 
measures could be routinely obtained which might prove of value 
in the assessment of mastery in reading, computation, and other 
performance skills; such measures*applied to knowledge retrieval 
exercises should also buttress the assessment of subject mastery. 

In regard to new skill dimensions not well cove'red previously, 
the computer makes possible the assessment of information pro- 
cessing skills that are difficult to assess by other means— such as 
skills involved in information search aiid organization, hypothe- 
sis generation and tesring,'restructuring of information, and other 
components of complex problem-solving arid decision-making 
tasks or other types of sequential thinking. This is possible 
because the computer can record the paths^ speed, and outcomes 
of such activities as they occur 6n subtasks within the sequence 
— in contrast. to the limited and schematized attempts discussed 
earlier to mimic this process with integrated sets of multiple- 
choice exercises. 

Special studies were highlighted in this. naep redesign because 
some ongoing capability to probe and explore, important assess- 
ment and development opportunities is needed as a basis for 
NAEp's continuous improvement and renewal. 
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Enhancing NAEP'S Mexi^^^^ 
To Meet l^ried Assessment Need^ 



The proposed naep redesign, affords vast flexibility in data analy- 
sis and in relating data to a variety of policy issues. But sophis- 
ticated analysis is not enough— in addition, naep needs sophis- 
ticated ways of communicating the results and of targeting the 
presentations to the needs of various audiences. Furthermore, 
NAEP's capacity to meet a variety of assessment needs would be 
markedly enhanced. by linking NAERdata to other national, state, 
and local data sources and by extending re.fined naep services to a 
broader clientele. Finally, since the objective-setting process is 
"just one step removed from the standard-setting process and since 
NAEP results bear directly on attained performance levels, naep 
should actively confront the issue of educational standards— not 
to set them/ but to clarify them and to help the various interested 
publics to set their own standards. Each of these points is briefly 
discussed in turn in the ensuing sections. ~— ^ 



Flexibility in Analysis and Reporting 

We have seen how the availability of covariances among exercises 
as well as the availability of scales having common meaning 
across population subgroups, age levels, and time periods serve^s 
* to improve the meaiiingfulness and interpretability of assessment 
results and trends. These are among the most important of the 
benefits deriving from bib spiralling and irt scaling, but they are 
by no means the only important benefits. We next review how irt 
scaling provides great flexibility in relating achievement data to 
policy questions. We then review methods for flexibly presenting 
achievement data so that its meaning and import are readily 
revealed in a particular policy context or. to constituencies with 
particular concerns. 
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Responding to Multiple Policy Issues 



■ ' An important by-product of the irt scaling of naep exercises is 
that estimates are available of each respondent's skill levels for 
those areas in which he or she was assessed. This means that the 
various achievement dimensions scaled by irt may be correlated 
with any of the variables of background, attitude, school, and pro- 
gram that are tiied to those individuals via the student and school 
quescionnaires,rschool records, or other means. 

Furthermore, these variables could also be used to generate 
group comparisons— such as students in coIJege preparatory ver- 
sus vocational programs, studentsun private versus public 
^ schools, or students exposed to preschool programs versus those 
who were not. Although tjhe resulting sample sizes in many of 
these group comparisons will not be large or nationally xepreseh-. 
tative7 they may be sufficient to provide timely provisional 
answers pending more intensive investigation. Given the avail- 
ability of other background variables characterizing the groups in 
question, these group- conlparisons may also be conducted con- 
trolling for a variety of home, school, and demographic factors by 
means of analysis of covariance techniques. Although student 

"skill estimates are not reliable enough for reporting at the individ- 

ual level, they are sufficiently r eliable for comparisons at the 

\ group level as well as for correlational work— where in any event 
unreliability can be taken into account. 

The only limitation on the nature and number of educational 
and policy questions that can be addressed in this fashion is 
whether or not relevant background and program variables were 
included in the student and school questionnaires or are derivable 
from other sources. The capacity to respond. to new policy ques- 
tions with existing data thus depends on our luck or our* wit in 
having included variables pertinent to the questions. 

Communicating Results to Multiple Audiences 

The most effective way to communicate complex statistical 
results is with graphical formats (Wainer & Thissen, 1981). Para- 
doxically, one rather compelling bit of evidence supporting this is 
the often poor quality of published graphics. The continued exis- 
tence of poor graphics is partially due to the amazing capacity of a 
human audience to be able to understand graphs accurately and 
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quickly even though they contain serious logical or technical 
faults. This helps to explain why so many of the empirical ifnves- 
tigations into the efficacy of various graphical formats have 
shown variable results and small differences irl efficacy among 
the alternative forms of graphs (MacDonald-RosS;T97.|; Wainer; 
Groves, & Lono, 1978, 1979). This tends to be true fpr large ef- 
fects in simple data structures, howiever, where any reasonable 
display, will work. When the effects are subtle or the data are 
complex, the displays must be done wisely. 

Graphics both clarify and reveal relationships. To illustrate 
how a good display can provide still more information after it is 
redesigned, consider the data in Tablp 4, which originally ap- 
peared as Table 10 in the 1981 Wep Report Number 11-R-Ol . The 
table presents all the information required to see certain effects- 
most notably the increase in performance of the lowest achieve- 
ipent class of nine-year olds. We note' that the data given in Table 
4 are distributional, providing achievement summaries for the 
various ability levels in each of three birth cohorts. To show these 
distributions more clearly still, we can utilize a variant of a box- 
and-whisker plot (Tukey, 1977). We will use a dot to represent 
performance in the extrenie abihty groups, and horizontal lines to 

-"repfesenrpefform^TC^i^ 
line to represent-the national mean. These horizontal lines will 
then be connected to form boxes which enclose approximately 
the middle 50% of the students. Such a plot is shown in Figure 8. 

The display in Figure 8 forces us to see what we had to look 
closely for in Table 4— specifically, we note that among the nine- 
year olds the lowest achievement group is further from the rest 

; than appears to be the case iri the other age groups. Also, an in- 
crease in performance in 1981 (the 1971 birth cohort) is evident, 
especially in the lowest achievement group. An interesting facet 
of these data revealed in this plot is that the 1962-63 birth cohort 
seems to perform more poorly than the other birth cohorts. This 
is seen in the 9-year old data (where those 9-year olds born later 
do belter) and again in the 17-year old data (where those 17-year 
olds born earlier do better). Thus, we begin to see some lonj^itudi- 
nal characteristics from these cross-sectional data. Our ability to 
observe these interesting' effects is partially due to the display 
methodology. Note that notched box plots (McGill, T\ikey, & 
Larson, 1978) could also be used to provide visual information on 
the statistical significance of observed visual differeiices. 
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Table 4 

■ National Results by Achievement Classes: Mean Percentages 
* and Changes in Correct Responses for Ages 9) 13 and In-School 17 
on Inferential Comprehension Exercises in Three Reading Assessmentsf 









Age 9: 27 Exercises 










Change 




Change 




Change 




1971 


1971-75 


1975 


1975-80 


1980 


1971-80 


Nation ■ 


60.5% 


0.9 


6r.4% 


2.5* 


63.9% 


3.5*- 


Achievement class 1 


35.5 


3.3* 


38.8 


4.7* 


43.4 


7.9* 


Achievement class 2 


57.8 


1.5 


59.3 


2.2* - 


61.6 


3.7** 


Aehicvement class 3 


68.5 


-0.1 


68.4 


^1.4 


69.8 


1,2 


Achievement class 4 


oU.U 


-U.o 


79.2 


'1.8 


0 1 .u 


1.0 








Age 13: 24 Exercises 










Change 




Change 




Change 




1970 


1970-74 


1974 


1974-79 


1979 


1970-79 


Nation 


56.1% 


-0.8 


55.3% 


0.2 


55.5% 


-0.6 


Achievement class 1 


35.0 


1.2 


36.2 


0.5 


36.7 


1.7 


Achievement class! 


50.8 


0.1 


50.9 


0.9 


5*1.8 ' 


1.0 


Achievement class 3 


61.8 


-1.3 


60.6 


-0.4 


60.2 


-1.7 


Achievement class 4 


76.6 


-.3.1* 


73.5 


-0.4 . 


73.1 


-3.4* 








Age 17: 25 Exercises 










Change 




Change 




Change 




1971 


1971-75 


1975 


1975-80 


1980 


1971-80 


Nation 


~ 64:2%' 


::0.9-' 


" 63.3%' 


^ -1.2 ^ 


-62.1% 


— -1lrl — 


Achievement class 1 


39.1 


2.5* 


41.6 


-1.4 


40.1 


l.O 


Achievement class 2 


58.7 


-0.1 . 


58.6 


-2.0 


56.7 


-2.0 


Achievement class 3 


72.3 


-2.6* 


. 69.7 


■. -1.2 


68.4 


-3.9* 


Achievement class 4 


86.8 


-3.4* 


83.5 


-0.3 


83.2 


-3.7* 



t Figures may not total due to rounding. 

^ • Indicates significnnt change in performance between assessments. 
Note: Achievement class 1 * lowest one-fourth ^ 
Achievement class 2 * middle loWest onc-lpurth 
Achievement class 3 « middle highest one-fourth 
, Achievement class 4 » highest one-fourth 

Cohort effects are seen only by contrast with these data because 
• the dependent variable (percent correct) cannot be compared 
across age levels— that is, 46 percent correct in the assessment of 
9-ycar olds does not compare to 46 percent correct in the assess- 
ment of 13-year olds. Yet, if the exercises were linked or equated 
in some way, we would be able to make these kinds of compari- 
sons.. Using the IRT scaling methodblogy espoused in this pro- 
posed NAEP redesign would yield an underlying skill sciile on 
which all groups could be directly compared. A plot of how such 
hypothetical data might appeajr is given in Figure 9. ^ 
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Figures 

National Results by Achievement Classes: Mean Percentages 
and Changes in Correct Responses for Ages 9, 13, and In-School 17 
on Inf^ential Comprehension Exercises in Three Reading Assessments 
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Figure 9 

Hypothetical Example Showing Longitudinal Trends Within Cohort 
Htfh 



o 



Low 




-L 



-L 



•70 



•80 ^ 
Year of DaU Collection 



•90 



The shift in emphasis from Figure 8 to this plot is the connec- 
ting of hirth cohorts over time. The slope of these connecting 
lines provides n measure of the rate of educational growth. The 
location of the points provides a measure of the change seen 
across cohorts. In Figure 9 wc see increases in skill from the 1963 
cohort to,the 1967 cohort to the 1971 cohort. If such a finding did 
occur we might then/look to exogenous variables to provide ex- 
planatory clues for the upward migration— such as better instruc- 
tion, increased emphasis on basics, or newer teaching techni- 
ques. If desired, one could use box* plots rather than points to 
provide a fuller picture of change in the entire distributions' o£ 
skill. 

It should be clear from this example that explaining complex 
data structures in prose or m tables provides neither the ease of 
comprehension, nor the richness of interpretation, that is avail- 
able in even these straightforward plots. More complex data re- 
quire still more imaginative plotting techniques— for example, 
how would one show' the same results as in Figure 9 broken down 
by geogriiphic region? 

ProDosed graphical reporting system. With these illustrations 
in^Simd, we can better discuss the proposed approach to the 
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reporting of results- This approach is principally graphic in*oricn- 
tation, flexible in design, and takes advantage of the latest com^ 
•putcr technology for the plotting and dissemination of findings. 

Sinec naep results are of interest to a diversity of aadicnces; try- 
ing to provide a single report or reporting mode that would satisfy 
all interests is doomed from the start. Trying to anticipate the 
various audiences and providing parallel documents at each of 
these levels has a greater possibility of success, but it is a very dif- 
ficult and cumbersome chore. What seems a more fruitful ap- 
proach is to provide information at a variety of levels for what are 
elearly th^. *iiajor audiences, yet simultaneouslv have the capa- 
bility for quick and easy generation of graphical and tabular 
answers to questions asked on an ad hoc basis. Thus, one might 
have a gerieraranswgr pre-prepared for'salient questions (e.g., 
What is the mean reading ability of.the 1961 birth cohort from age 
9 until age 17?) and allow specialized answers to be generated on 
demand (e.g., the same question, but just for rural schools). 

What is needed is an interaQtive dynamic system of graphical 
data analysis to provide the capacity both to make quick responses 
and to ask questions sugg^ested by the answers to previous ques- 
tions. This;system should provide both static and kinematic dis- 
play capabilities. Fqr communication in the traditional print 
media sialic displays continue to provide an accurate and effi- 
cient method. Wisely chosen graphs can often deliver quite com- 
plex messages. Wc expect thj.it this will continue to be the prin- 
cipal mode of iiiformationvdisriemination using computergraphic 
hardware and 'Software linked . to the appropriately structured 
NAEP data base'. Recent developments in computer technoJogy and 
software design also facilitate the routine use ol kinematic dis- 
plays, v/hich make possible compelling and informative data pre- 
sentations via film or TViiicdia. ■ . 

Kinematic; displays have a number of overlapping uses. First, an 
interactive kinematic display provides an easy way for an investi- 
gator to explore both the gross and line structure of complex data, 
by panning around the data structure noting regularities and then 
zooming in on irregularities and outlier's. Using such techniques 
one can spot an unusual data configuration, zoom in on it and im- 
mediately bring to bear exogenous program or background vari- 
ables to try to understand plausible causes for the atypical behav- 
ior. Second, complex multivariate data structures are often best 
'Iseen in a kinematic display. The precise sort of display depends 
on the data, tor example, with three-dimensional data, one can 



produce an evocative. three :onal image by rotating the 

three-dimensional scatter p! ' 1 time. Even though the dis- 
play is on a flat screen, the h iceived is three dimensional. 

Another kind of kinematic display, which is quite useful for 
viewing and comparing a series of two-dimensional figures/is the 
alternagraphic plot. This method alternates two or more plots 
which are to be compared quickly enough so that the eye super- 
imposes one on the other, yet slowly enough so that the separate 
displays can be seen as well (about 500 milliseconcls each). 

As a quick illustration of how naep might use some of the sim- 
pler aspects of this kinematic display technology, consider some 
variatiorft on Figure 9. Suppose we were interested. in comparing 
the data shown in that figure with the same data for a specific 
subpopulation such as an ethnic, sex, or jegional breakdown. We 
could use an alternagraphic display, alternating back and forth be- 
tween the data in Figure 9 and the data for the subpopulation. A 
short viewing time would provide a clear picture. This method 
could be expanded to more than two plots. 

While kinematic displays provide a powerful data-analytic tool 
'for investigators, the main intention here is for the;Cpmmunica- 
tion of results to a broad audience. The vast majority of the "U.S. 
population get most, of their information abput the outside world 
through the video media. Thus, in order to communicate facts 
and understanding about complex data structures to the public, it 
would'be a matter of small difficulty to prepare video tapes using 
kinematic display technology. The possibilities o'pened up by 
such a capacity are both broad and exciting; the time is certainly 
; ripe for their exploration and use . 



Extending NAEP's Impact 

The impact of *NAEP results could be both extended and enriched 
by linking naep data to that in other data bases and by linking the 
national assessment program to other assessment programs. 

Linkinj^ to Other Data Bases 

The power and value of past arid future naep data would be tre- 
mendously enhanced if the responses of naep samples could be di- 
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rcctly compared to, or interpreted in the light of, the responses of 
different samples to the sarue or demonstrably equivalent exer- 
cise materials. For example^ the use of naep exercises in other na- 
tional survey>s could provide trend data not otherwise available 
given the spacing bctweer^ assessments. Furthei'mOre, the, use of 
NAEP exercises in samples with a different design— perhaps a na- 
tional sample in which multiple minority groups have been sys- 
tematically ovcrsampled— would permit mox^-intensive investi- 
gation of differential performance correlates than is possible with 
NAEP data alone. In addition, therejs^lso the possibility of linking 
NAEP findings lo data bases in v/hich more comprehensive descrip- 
tors of. the respondents are available. T''cse data might include 
extensive student variables (cognitive and noncognitive), back- 
ground factors (ethniC; parental); or situational characteristics 
(school; community, labor market). 

These linkages could come about in three major ways: ( Ij by • 
use of NAEP exercises, in other surveys where the data collection 
procedures were sufficiently similar to permit corriparisons; (2) 
by equating naep exercises to similar measures in other assess- 
ments and s^irveys; and; (3) by embedding naep exercises in the 
instrumentation for future assessments and surveys. Each of 
these possibilities is briefly discussed in the ensuing paragraphs. 

NAEP exercises in other surveys. Since naep exercises were de- 
veloped with great care and the associated response data provide a 
national perspective; it would be beneficial to use teleas'.^d nAep 
exercises in other surveys. Indeed; this wa,s done in the 1980 data 
collections of HigH School and Beyond (hsb)— the name given to 
the new high-scliool cohorts surveyed in the spring of 1980 in the 
national longitudinal studies sponsored by the National Center 
for Education Statistics. 

In 1978; the test battery for High School and Beyond was de- 
signed by Educational Testing Service.. Wishing to include in the 
battery a set of exercises measuring science knowledge; ets rec- 
ommended that NAEP exercises be used in order to fulfill several 
objectives, one of which was the establishment of a link between 
NAiiP and HSH. The naep science exercises were included in the 
1980 sophomore battery. TheU; in 1982; the original sophomores 
were given exactly the same science exercises again; at which 
time most of the students were seniors. 

Because of differences in the mode of administration; naep and 
HSB data on these same science exercises differ in a number of 
ways. In naep tlie exercises were group administered with tape-re- 
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corded instructions and generous time limits, whereas in hsb the 
instructions were read and explained by a survey administrator 
and thcrr was a 10-minute time Wiph for 20 science questions. 
Moreover, the' NAEi' exercises had six options including "'I don't 
know", whereas in hsh the "I don't know" option was omitted. 
Also, it should be kept in mind that the naep cohorts were selec- 
ted by age, whereas the hsu respondents were grouped by high 
school grade level. 

Thiis, even when the respondents are comparable as far as edu- 
cational development is concerned, there are some possibly seri- 
ous constraints on what can be concluded from comparisons be- 
tween the performance of nahi' and hsh samples. But there may 
also be some useful comparative findings as follows: 

(1) Since the hsb respondents who first took the naep science 
exercises as sophomores later took the same exercises as seniors, 
the HSB results provide some useful data as to which exercises are- 
the most sensitive measures of growth in science knowledge from 
the sophomore to the senior year. Also, since the hsb data are cer- 
tain to be used in studies of school effects, there should be infor- 
mation on the correlation between the science exercises and 
school variables. 

(2) The MSB data file has a much broader range of information 
on the characteristics of individual students and on the schools 
they attended than does the naep file, the hsb file thus provides a 
more comprehensive picjure of the characteristics of students 
who were successful on the naep science exercises in comparison 
with those who were not successful. *• 

(3) Since scores on the Scholastic Aptitude Test and on the 
Armed Services Vocational Aptitude Battery are being retrieved 
for HSB students, it would be possible to link performance on the 
NAEP science exercises.to performance on the sat and asvab. 

(4) As part of an evaluation of the hsb battery, the sophomore 
data were factor analyzed, with the results reported in Table 5 
(Heyns Hilton, 1982). These results suggest that the naep sci- 
ence exercises, as administered under hsb conditions, reflect a set 
of fairly broad cognitive abilities— as v/itness the science loading 
of . 61 i)n a verbal factor and .21 on a math factor, with some vari- 
ance left over reflective of science information. 

Other linkages to existing data files are possible and may pro- 
vide valuable insights. Approximately 25 states have used naep 
exercises in various numbers and in various ways (usually in large 
group administrations). As with the hsu data, the state data could 
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Table 5 



Two Factor Solution for "High School and Beyond"Sophomorcs 
and Reliabilities (N « 26, 1 10) 



Percentage 

Factor loadings Variance 



V 


Verbal 


Mathematics 


Accounted For 


' KH 20 


Vocabulary 


.83 




68 


.81 


Reading 


.86 




74 


.78 


Mathematics I 




.94 


88 


.85 


Mathematics II 




72 


52 


.54 


Science 


,61 


.21 


64 


75 


Writing 


.61 ■ 


.'l8 


60 


.80 


Civics 

' — — \ — ■ ■ 


.69 


-.01 


45 


.53 



Correlation Between Factors 



V Q 

Verbal M)0 0.841 

Quantitative 0.841 1 00 



be particularly valuable where relatively large numbers of special 
populations were tested, a possible example being Native Ameri- 
cans'ih certain western states. As a final example of the inclusion 
of NAEP exercises in other surveys, we mention. >v possibility of 
foreign administrations, which would provide th ^:portunity for 
an international perspective on educational achievement. These 
could be programmatic cross-national surveys, such as the inter- 
national studies of comparative educational achievement con- 
ducted by the International Education Assessment^ or coopera- 
tive arrangements for the exchange of exercises with the national 
surveys of other countries. 

Equating NAEP exercises to other existing measures. Where the 
interest is in linking naep exercises to similar but not identical 
exercises already used in other surveys, it may be possible to 
equate the two sets of exercises by means of specially designed 
equating experiments. As an instance, Beatca, Hilton, and 
Schrader (1977) equated similar exercises from two different data 
sets as part of a study of the sat score decline. 

Some examples of relevant data ' ises that mir^ht b6 linked to 
.NAEP via equating are Project Talent, the Coleman Equal Educa- 
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tional Opportunity Survey, the ets Study of Academic Prediction 
and Growth, the nces National Longitudinal Studies, the ETS-Head 
Start Longitudinal Study of Disadvantaged Children, and the De- 
partment of Labor National Longitudinal Survey. 

Embedding NAEf exercises in future surveys. VVhat is of consid- 
erably more promise is the possibility of embedding naep exer- 
cises in future surveys and in educational achievement tests de- 
veloped by comitierciaj publishers. Oh this latter score, a naep 
service offering commercial publishers an' opportunity to obtain 
nationally-normed exereises would both upgrade the quality of 
educational testing generally and provide much needed revenue 
to NAEP for underwriting other^activities. As a consequence, since 
commercially published educational tests are widely used in state 
and local assessments, the inclusion of NAEP-normed exercises 
embedded within them would both link these , assessments to 
NAEP for purposes of research and provide a current national per- 
spective for interpreting the state or local findings. 

Extending NAEP Assessment Services 

The ultimate value of naep must be viewed in terms of its con- 
tributions to a variety of users attempting to address importarit 
educational issues. Congressional appropriations as well as ad- 
ministration support for naep assumef and have a right to depend 
uponvoptimizDtion of these annual expenditures. Perfection of an 
instrument designed to yield specific reports to limited audiences 
* can hardly be justified in today's political and economic environ- 
ments. Thus, it seems reasonable for naep to^ pool resources with 
other interested parties for mutually advantageous purposes. 

For example, asking states to share the costs of exercise devel- 
opment will both permit naep to do a better job and assure the 
sftate that high quality exercises will be available on their sched- 
ule^t a fraction of what it would cost to develop them indepen- 
dently. Cb. irgiiig a state or a large city a $5,000 consulting fee for 
technic?! assistance might help it save $50,000 in expensive ifail- 
urc, while permitting naep to maintain' a valuable service. Setting 
a reasonable fee to participate in a' Large Scale Assessment Con- 
ference challenges naep to prepare a worthwhile agenda and at the 
same time discourages casual attendance. 

One of the most important user groups is represented by the 
over states that currently have some form of assessment or 
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testing program , It is not envisioned here that naup would provide 
services that are dirccUy competitive with commercial or other 
non-profit organizations. On the other hand, it is possible to con- 
ceive an array of arrangements developed to accommodate states 
or large systems with or without a third partner. 

For example; thcee assessment "packages" could be developed 
and made available to states to form part or all of their state as- 
s,essment program. The main features of these "packages" would 
be that they 

• provide a relationship to objectives and standards, ' 

• permit comparison of state performance with naep national 
results, 

• represent real cost savings to state assessment programs by 
providing already developed items of high quality and of 
known performance, 

• include local options for specialized objectives, and 

• replace expensive state-wide programs with an economical, 
. high quality program, tailored to the state's needs and with 

results that permit comparison to national data. 

These packages would'be designed so as to be incremental— for 
example, as a first step, a state might contract with naep to pro- 
vide exercises on a regular schedule for certain specified curri- 
culum^subjects. This would obviate the necessity for the state to 
develop its own test development capability. A second step might 
be for a state to contract for the complete test development pro- 
cess. A final step might be for the state to ask na^.p to run its com- 
plete state program simultaneously with the national data collec- 
tion effort and provide the state v/ith results and analyses. 

The size of the state population assessed and the complexity of 
the program would impact costs, but in every case economies of 
scale should operate in favor of this being a less expensive alterna- 
tive than a state managing a completely parallel effort; In addi- 
tion, it may be found that samples for the national assessment 
and the state assessment can be drawn in such a way that they 
complement each other, to the mutual benefit of both assess- 
ments. In all of these versions, comparisons with national data 
would be possible. ' , 

In the par 1, arrangements with naep have been difficult for 
states because of vostponemcnts caused by naep budget changes 
and such. What is suggested here are contractual arrangements 
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with states which arc quite independent of naep assessment bud- 
gets. As more states participate in this type qf arrangement; the 
lotal program would be strengthened. It would obviously have to 
realize economics and greater quality for the states as well as in- 
come and facilitation of data collection for the national effort. 
The goal is a financially viable national assessment program, 
which' would mean more .innovative exercise development activi- 
ty, more sophisticated data analyses^ and more useful reports to 
school districts, states, government agencies, and the public. 



Progfress Inward Standards As 
Standards for Progress 

The overall activities of n.v^p skirt all sides of the issue of educa- 
tional standards without addressing the heart of the matter. Most 
Oi tiie elements intrinsic to the setting and monuorihg of educa- 
tional standards are already an integral part of naep. These in- 
clude the setting of learning objectives, the development of mea- 
surement procedures specifically geared toward those objectives, 
and the reporting of student performance levels in pursuit of 
those objectives. What is missing is a pluralistic process for tak- 
ing the next step— for helping the various interested segments of 
society make the value judgments needed to set their own stan- 
dards and to monitor and revise them over time. Descriptions of 
dbjectives that are commonly agreed upon and of performance, 
levels that are currently being attained in different societal sub- 
groups go a long way toward informing the societal standard- 
.setting process. 

Objectives and Standards 

An important feature of naep's procedures is that the learning ob- 
jectivcs-guiding the assessment are determined by consensus as 
to their relevqnee and importance. This step is more than half the 
battle in standard setting because these objectives, in essence, are 
operational statements of what is worth teaching and important 
to learn. In effect, these objectives specify the areas in which it is 
worthwhile having .^nd? 
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A cautionary note is required here, however; because the con. 
ccpt of standards nn a pluralistic society requires some provision 
for local variation and self-determination. In contrast, the princi- 
ple of consensus mi^ht yield a common denominator that omits 
important educational goals not shared by everyone. Although a 
"national" assessment. might reasonably be limited to common 
goals, it would not truly be national for a pluralistic nation. 

What is needed is a method for augmenting the present system 
in order to obtain judgmental data descriptive of varying patterns 
o'f educational priorities set by different societal subgroups across 
the full range of objectives. Thus, by placing objective setting in 
the context of pluralistic standards, some of the pressure toward 
consensus would be relaxed. As a consequence, the total set of 
objectives would include not only those for which substantial 
consensus wos achieved, but also those important objectives pri- 
marily embraced by substantial subgroups. Although different re- 
porting profiles for different groups could be developed, it should 
prove. more useful for each. group to appraise performance levels 
on its own priority objectives in the context of the diverse objec- 
tives of other groups as well as in the context of the common ob- 
jectives cutting across groups. Diversity of objectives is also the 
best protection against the elevation of consensual objectives to 
the level of implicit national standards. 

For these reasons, it appCrars that objective setting should be ad- 
dressed in the arena of pluralistic standards. Accordingly, we pro- 
pose that the Exercise Development Committee of- the Assess- 
ment Policy Committee be broadened to one on Objectives and 
Standards, with the charge not only to relate inwardly to the naep 
exercise-development process but to relate outwardly to the soci- 
et,il standard-setting process. 



Performance Levels and Standards 

Another critical element in, the standard-setting process is infor- 
mation for each objective on the current performance levels and 
trends in various societal subgroups. Inverting the customary pre- 
scription that one must first determine the objective^ of instruc- 
tion before developing measures of learning outcomes, Henry 
Dyer ( 1967) once suggested that it niight not be possible to decide 
. what the objectives ought to be until one knows what the current 
outcomes are. The point is even more appropriate when applied 
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to standards. It might not be possible to decide what the stan- 
dards ought to be until one knows what current performance 
levels are. 

Detailed information on this point is available through naep, 
but it would be even mord valuable if it were provided in the con- 
joint critcrionreferenced and norm-referenced form made possible 
by lUT sealing, as proposed in this nahp redesign. As summarized 
in Figure 5, IRT sealing permits one to estimate. the proportion of 
correct answers to each exercise expected for individuals in gach 
subgroup at any point on the skill scale. One can also estimate.- 
the proportion of individuals in any group who have less than 
some specified probability of success on any given exercise, This 
type of detni)*»'J information, as aggregated in various ways, pro- 
vides the kiHi jf group performance distributions needed to in- 
form the staiuJard-setting process. 

Better still, the capacity to relate thi' group porfoimanre to 
scales anchored by benchmark exercises provides concrete exam- 
plars for characterizing different performance levels. Eventually, 
the development Of behavioral anchors for these dimensions, 
such as those exemplified by the Foreign Service Institute scale of 
foreign language attainment, would enrich this characterization 
with verbal summaries of related real-world capabilities associ- 
ated with each scale level. What is sfill needed to move on to edu- 
cational standards arc the value judgments as to which perfor- 
mance levels are deemed imsatisfactory, adequate, or excellent 
by different societal groups. 

Values and Standards 

Our intent in broaching the issue of educational standards is not 
to involve naup directly in the standard-setting process, nor to 
settle for its indirect involvement as a mere data resource on eon- 
sensual objectives and performance levels. As we havi: seen, naep 
is already directly involved in one critical aspect of the standard 
problem—namely, the choice via objective-.setting of those areas 
that are worth teaching and learning and hence are worthy of 
standards. Since in making such choices naep needs to be sensi- 
tive to th.^ pluralistic values of various societal groups, it seems 
sensible t nat naep should be more actively involved with societal 
groups on the iss»ie of standards. 

Again; the intent is not for naep to engage in the standard- 
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setting process, but to engage the public with nacIP results over 
the issue of educational standards, naep data are, or could be, 
highly pertinent for this'purposc. And it puts naep in a position, 
to uso Bruner's (1966) words, of providing "the full range of alter- 
natives to challenge society to choice/' 
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IV, 

Epilogue 



The Inst chapter of this report of a proposed NAtr redesign fo- 
cusses on ways to improve naep's flexibility for meeting varied 
assessment needs, with particular stress on heightening naup ca- 
pabilities for 



• reaching multiple audiences in effective fashion, 

• linking to other valuable data sources, 

• enhaneing and exterjding assessment services, and 

• engaging the public around naup data on the important social 
issue of educational standards. 

Thus, our closing emphasis is on strategies to improve policy im- 
pact, disseminarion, Icnowledge utilization, user services,, and 
public involvement. 

But we should not forget that the main reason this closing em- 
phasis is needed was covered in Chapter II. naep's perennial diffi- 
culties in policy analysis, dissemination, service and Icnowlcdge 
utilization, and public engagement stem directly from the design 
problems addressed there. The original design led to performance 
data that lacked direct comparability across exercises, age levels, 
population subgroups, and time periods as well as to the results of 
other assessment programs. This resulted in findings of debatable 
meaning that were difficult to interpret, especially with respect 
to time trends. It is not surprising that such data have hail little 
impact on American education. 

The proposed redesign remedies these problems by means of bib 
spiralling and mx scaling. This makes possible the formation of 
meaningful scales whose construct validity, and hence interpre- 
tability, can be appraised empirically. It also enables the develop- 
ment of scales with common meaning across exercises, age 



• addressing multiple policy questions. 
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levels, sub>;r()ups, aiul tiujc periods, thereby pernutiln« powerful 
eomparisons with clear implications. 

Furthermore,. the proposed redesign—not only of data collec- 
tion and analysis proccJurcs, hut of reporting, dissemhiatlon, and 
utilization procedures— is accomplished in ways that are 

• protective of the links to past naup data, 

• innovative in its move to new psychometric methodology, 
and , . 

• aggressive in its outreach. 
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